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support  a  variety  of  models.  Often  details  of  an  implementation  become  visible  in  the 
programming  environment,  restricting  future  implementation  improvements  because 
of  compatibility.  The  lack  of  an  interface  between  model  and  machine  issues  also  com¬ 
plicates  the  translation  of  a  machine  improvement  into  a  performance  improvement 
for  model  applications. 

This  thesis  defines  Pi,  a  parallel  architecture  interface  that  separates  model  and  ma¬ 
chine  issues,  allowing  them  to  be  addressed  independently.  This  provides  greater 
flexibility  for  both  the  model  and  machine  builder.  Pi  addresses  a  set  of  common  par¬ 
allel  model  requirements  including  low  latency  communication,  fast  teisk  switching, 
low  cost  synchronization,  efficient  storage  management,  the  ability  to  exploit  locality, 
and  efficient  support  for  sequential  code.  Since  Pi  provides  generic  parallel  opera¬ 
tions,  it  can  efficiently  support  many  parallel  programming  models  including  hybrids 
of  existing  models.  Pi  also  forms  a  basis  of  comparison  for  architectural  components. 

Pi  is  evaluated  in  two  ways.  First,  several  mechanisms  required  by  existing  parallel 
models  are  constructed  on  the  interface.  These  examples  are  executed  and  evalu¬ 
ated  using  a  Pi  simulation  environment.  Then  a  machine  substrate  that  supports  the 
interface  is  specified.  It  is  designed  to  efficiently  support  the  generic  parallel  opera¬ 
tions  provided  in  Pi.  The  feasibility  of  gate  array  implementation  is  considered.  The 
effectiveness  of  this  machine  substrate  at  supporting  Pi  is  evaluated.  The  role  of  a 
machine  compiler  below  the  interface  is  also  discussed. 

This  thesis  demonstrates  a  parallel  architecture  interface.  Pi  efficiently  supports 
several  model  mechanisms.  A  machine  substrate  which  effectively  implements  Pi  is 
presented. 
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Abstract 


I 

Existing  parallel  architectures  are  constructed  monolithically,  with  no  well  defined 
boundaries  separating  model  and  machine  issues.  This  makes  it  difficult  to  evaluate 
the  effect  of  a  single  component  of  an  architecture,  or  compare  it  with  the  correspond¬ 
ing  components  of  other  architectures.  Machine  hardware  is  specialized  to  support  a 
single  programming  model,  even  though  similar  but  more  general  mechanisms  could 
support  a  variety  of  models.  Often  details  of  an  implementation  become  visible  in  the 
programming  environment,  restricting  future  implementation  improvements  because 
of  compatibility.  The  lack  of  an  interface  between  model  and  machine  issues  also  com¬ 
plicates  the  translation  of  a  machine  improvement  into  a  performance  improvement 
for  model  applications. ' ^ 

This  thesis  defines  Pi,  a  parallel  architecture  interface  that  separates  model  and  ma¬ 
chine  issues,  allowing  them  to  be  addressed  independently.  This  provides  greater 
flexibility  for  both  the  model  and  machine  builder.  Pi  addresses  a  set  of  common  par¬ 
allel  model  requirements  including  low  latency  communication,  fast  task  switching, 
low  cost  synchronization,  efficient  storage  management,  the  ability  to  exploit  locality, 
and  efficient  support  for  sequential  code.  Since  Pi  provides  generic  parallel  opera¬ 
tions,  it  can  efficiently  support  many  parallel  programming  models  including  hybrids 
of  existing  models.  Pi  also  forms  a  basis  of  comparison  for  architectural  components. 


Pi  is  evaluated  in  two  ways.  First,  several  mechanisms  required  by  existing  parallel 
models  are  constructed  on  the  interface.  These  examples  are  executed  and  evalu¬ 
ated  using  a  Pi  simulation  environment.  Then  a  machine  substrate  that  supports  the 
interface  is  specified.  It  is  designed  to  efficiently  support  the  generic  parallel  opera¬ 
tions  provided  in  Pi.  The  feasibility  of  gate  array  implementation  is  considered.  The 
effectiveness  of  this  machine  substrate  at  supporting  Pi  is  evaluated.  The  role  of  a 
machine  compiler  below  the  interface  is  also  discussed. 


This  thesis  demonstrates  a  parallel  architecture  interface.  Pi  efficiently  supports 
several  model  mechanisms.  A  machine  substrate  which  effectively  implements  Pi  is 
presented. 
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Chapter  1 
Introduction 


The  von  Neumann  model  of  sequential  computers  was  a  milestone  in  computer  evo¬ 
lution  because  it  defined  an  interface  between  programming  models  and  machine 
implemeatations.  Both  models  and  machines  have  advanced  considerably  since  then, 
but  the  interface,  the  way  each  side  st  ’s  the  other,  has  remained  relatively  unchanged. 
This  separation  of  concerns  is  responsible  in  part  for  a  key  capability  of  today’s  '•om- 
puters;  the  ability  to  support  a  variety  of  programming  languages  and  applications 
using  the  latest  machine  design  techniques  and  technology. 

Wanted:  A  Parallel  Interface 

A  similar  Interface  can  provide  the  same  beneficial  effects  for  parallel  computers.  This 
type  of  interface  is  sorely  needed.  Currently,  parallel  architectures  are  constructed 
monolithically,  as  shown  in  Figure  1.1.  An  architecture  is  a  composition  of  model 
and  machine  features  with  no  well-defined  boundaries  separating  different  asne'ts  of 
the  design.  This  makes  it  difficult  to  evaluab  the  effect  of  a  single  compoi.jnt,  or 
compare  it  with  the  corresponding  components  of  other  architectures. 

Sometimes  details  of  an  implementation  become  visible  in  the  programming  environ¬ 
ment.  restricting  future  implementation  improvements  because  of  compatibility.  The 
lack  of  an  interface  between  model  and  machine  issues  also  complicates  the  trans¬ 
lation  of  a  machine  improvement  (e.g.,  an  improved  communication  network)  into  a 
performance  improvement  for  model  applications. 

Despite  the  benefit  of  a  parallel  interface,  it  has  not  been  possible  to  specify  one.  Ad¬ 
equate  experience  with  model  and  machine  issues  in  parallel  architectures  is  required. 
It  is  first  necessary  to  identify  what  parallel  architectural  features  and  mechanisms 
are  needed,  and  obtainable. 

Parallel  Architecture  Requirements 

Recently,  a  consensus  on  requirements  has  begun  to  surface  in  parallel  architecture 
research.  The  following  capabilities  have  been  identified: 
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I  machine  B 

organization  B 

network 

technology 

Figure  1.1:  The  Problem:  Architectural  Monoliths 

•  low  latency  communication 
•  fcLst  task  switching 
•  low  cost  synchronization 
•  “fficient  storage  management 
•  the  ability  to  exploit  all  types  of  locality 

•  the  ability  to  efficiently  support  sequential  code  segments  in  a  parallel  environ¬ 
ment 

These  requirements  are  common  to  most  existing  parallel  models.  With  them,  it  is 
now  possible  to  specify  an  interface. 


Introducing  Pi 

This  thesis  introduces  Pi,  a  parallel  architecture  interface.  Pi  is  based  on  the  generic 
parallel  model  requirements  described  above,  so  a  machine  substrate  that  supports 
Pi  can  implement  multiple  parallel  models.  Pi  provides  an  abstraction  that  separates 
model  and  machine  issues  allowing  them  to  be  addressed  independently.  It  forms 
a  basis  of  comparison  for  architectural  features  and  it  provides  greater  flexibility  for 
both  the  model  and  machine  builder. 

’  Model  issues  are  related  to  the  problem  solving  style.  Machine  issues  are  related  to  the  physical 
realization  of  a  problem  solving  system 
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Figure  1.2:  Pi  Overview 

Figure  1.2  presents  an  overview  of  Pi.  Each  box  above  the  interface  represents  the 
mechanisms  required  for  a  model,  constructed  with  Pi  operations.  For  example,  the 
data  parallelism  model  includes  a  mechanism  for  supporting  set  synchronization  (a 
synchronization  tree,  for  example).  In  addition  to  existing  parallel  models,  Pi  also 
supports  hybrids  of  mechanisms  from  different  models.  Below  the  interface,  each 
box  represents  a  different  Pi  machine  substrate.  All  of  these  machines  are  designed 
specifically  to  support  Pi  (i.e.,  they  are  not  specialized  for  a  particular  model).  The 
estimated  number  of  nodes  and  machine  cost  is  listed  for  each  substrate.  As  new 
machine  building  techniques  and  technologies  are  incorporated  into  future  substrates, 
the  performance  improvements  can  be  realized  by  all  models  above  the  interface. 

Pi  does  not  eliminate  the  interrelationship  of  models  and  machines.  The  costs  of  Pi 
operations  on  a  particular  machine  affect  how  model  mechanisms  are  best  constructed. 
They  also  have  an  influence  on  which  model  provides  the  most  effective  solution  to 
a  problem.  The  interface  attempts  to  provide  the  functionality  of  the  underlying 
machine  in  a  way  that  does  not  unnecessarily  bias  a  solution  towards  a  particular 
model.  An  advance  in  machine  design  may  lower  the  cost  of  an  interface  operation, 
requiring  changes  in  a  model’s  implementation  or  the  solution  to  a  problem.  But  since 
only  cost,  not  functionality,  has  changed,  a  major  reimplementation  is  not  expected. 

A  system  incorporating  an  architectural  interface  is  shown  in  Figure  1.3.  A  problem 
solving  style  is  supported  by  a  machine  independent  compiler.  This  compiler  produces 
a  ‘‘program”  of  basic  parallel  operations  from  Pi.  The  interface  is  supported  below 
by  a  combination  of  a  machine  dependent  compiler  and  machine  hardware. 

Pi  serves  as  an  image  of  what  lies  on  the  other  side.  It  must  accurately  represent  the 
requirements  and  constraints,  while  abstracting  away  unnecessary  details.  It  must 
be  expressive  enough  to  allow  efficient  interaction  across  the  interface,  but  not  overly 
constrain  the  design  freedom  of  either  side.  These  interface  issues  are  embodied  in 
the  following  design  criteria,  used  in  the  development  of  Pi: 

•  Interface  Abstractness:  Does  the  abstractness  of  the  interface  provide  a  recison- 
able  balance  of  flexibility  between  the  model  and  machine  implementation? 

•  Presentation  Accuracy:  Are  the  functionality  and  requirements  of  each  side  of 
the  interface  accurately  presented? 
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Figure  1.3:  Pi:  An  Architectural  Interface  Between  Model  and  Machine 

•  Functionality  Match:  Does  the  functionality  provided  in  the  interface  match 
(a)  the  requirements  of  model  mechanisms,  and  (b)  the  capabilities  of  machine 
i  mplemen  tati  ons  ? 

•  Implementation  Bias:  Do  interface  operations  minimize  bicis  towards  particular 
implementations  of  models  and  machines? 

These  criteria,  together  with  the  parallel  model  requirements,  formed  the  basis  of  the 
design  of  Pi. 

Thesis  Contributions 

•  Pi  Interface:  A  parallel  architecture  interface  based  on  generic  parallel  model 
requirements  is  defined. 

•  Model  Mechanisms  in  Pi:  Several  model  mechanisms  (a  shared  memory 
protocol,  set  synchronization,  object  name  translation,  and  non-resident  han¬ 
dlers)  and  two  parallel  applications  (n-body  and  relaxation)  are  implemented 
and  evaluated  on  a  parallel  architectural  interface. 

•  A  Pi  Substrate:  A  machine  architecture  that  efficiently  supports  generic  par¬ 
allel  operations  is  defined.  The  feasibility  of  one  implementation  is  demon¬ 
strated. 
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Related  Work 


Other  research  has  studied  parallel  architecture  interfaces  at  different  levels  of  ab¬ 
stractness.  Related  work  falls  into  three  classes:  language  interfaces,  machine  inter¬ 
faces,  and  theoretical  models. 

The  first  class  of  related  work  assumes  a  “language-centered”  view  of  the  problem 
(e.g.,  Paralation  LISP  [49],  dataflow  languages  [1],  AFL-1  [10],  and  Nial  [41]).  They 
provide  a  convenient  programming  environment,  often  supporting  multiple  execution 
model  mechanisms  (e.g.,  data  parallelism  and  MIMD  fork/joins).  However,  this  ap¬ 
proach  often  poses  a  large  “semantic  gap”  between  the  interface  and  the  machine 
implementation.  As  a  result,  an  efficient  implementation  of  the  language  interface  is 
often  difficult.  In  addition,  they  often  provide  no  separation  of  model  and  machine 
issues. 

The  second  class  of  related  work  assumes  a  “machine-centered”  view  of  the  problem 
(e.g.,  HEP  [34],  MPP  [7],  Cosmic  Cube  [51],  the  NON-VON  [52],  the  Transputer  [31], 
the  Connection  Machine  [56],  WARP  [4],  the  NCUBE  2  [43],  RP3  [46],  J-Machine 
[ISl.  the  BBN  Butterfly  [8],  and  the  BBN  Monarch  [48]).  In  this  class,  novel  machine 
organizations  and  new  technological  advances  are  combined  in  a  machine  design. 
The  machines  often  sport  impressive  raw  performance.  Unfortunately,  programming 
model  considerations  are  often  neglected,  resulting  in  a  large  semantic  gap  between  the 
machine  and  the  programming  environment.  This  often  restricts  the  use  of  these  high 
performance  machines  to  only  specialized  classes  of  applications  and  programming 
models. 

The  last  class  of  related  work  includes  theoretical  models  of  parallel  computation 
(e.g.,  Communicating  Sequential  Processors  [26],  PRAM  [50],  Actors  [3],  Type  Archi¬ 
tectures  [55],  Bulk-Synchronous  Parallel  Computers  [58]).  These  models  are  valuable 
for  studying  the  theoretical  capabilities  and  limits  of  parallel  computation.  However, 
they  usually  pursue  mathematical  clarity,  rather  than  being  sullied  with  the  practical 
details  of  solving  real-world  problems  with  real-world  machines.  They  typically  pro¬ 
vide  poor  architectural  interfaces.  In  the  worst  case,  they  produce  two  semantic  gaps 
(programming  language  model,  and  model  implemented  machine).  However, 
because  of  their  clarity,  theoretical  models  often  provide  inspiration  for  more  practical 
interfaces.  The  relationship  of  these  models  to  the  desired  interface  is  analogous  to 
the  relationship  of  the  Turing  machine  model  to  von  Neumann’s  interface. 

Thesis  Outline 

This  thesis  begins  with  the  definition  of  Pi,  presented  in  Chapter  2.  A  discussion  of 
interface  design  choices  is  presented. 

An  architectural  interface  cannot  be  examined  in  isolation.  This  thesis  also  stud¬ 
ies  how  model  mechanisms  are  built  on  Pi.  In  Chapter  3,  several  mechanisms  from 
parallel  models  are  constructed  including;  shared  memory  (with  caches),  set  synchro¬ 
nization,  object  name  translation,  non-resident  handler  support,  n-body  simulation, 
and  relaxation  simulation. 
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In  Chapter  4.  these  examples  are  evaluated  using  the  Pi  source  code,  and  execution 
statistics  from  a  Pi  simulation  environment. 

Since  Pi  is  to  be  supported  by  a  specially  designed  hardware  substrate,  an  example  of  a 
Pi  machine  architecture  is  presented  in  Chapter  5.  This  substrate  design  incorporates 
the  generic  parallel  model  requirements  described  earlier,  as  well  as  the  results  from 
the  evaluation  of  the  examples.  This  chapter  also  includes  a  discussion  of  gate  array 
implementation. 

Chapter  6  examines  how  the  requirements  of  Pi  are  supported  on  this  machine  ar¬ 
chitecture.  It  also  provides  some  cost  estimates  for  Pi  operations  supported  on  the 
described  hardware  architecture. 

Finally,  Chapter  7  summarizes  the  conclusions  of  the  thesis  and  describes  future 
directions  for  the  work. 
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Chapter  2 

The  Pi  Interface 


This  chapter  presents  Pi,  a  parallel  architecture  interface.  Pi  is  defined  in  the  form 
of  an  abstract  machine,  although  a  specific  machine  implementation  is  not  implied. 
Structurally,  the  Pi  abstract  machine  has  a  familiar  form;  a  collection  of  nodes  con¬ 
nected  by  a  communication  network.  However  its  behavior  embodies  a  broad  cleiss  of 
machine  mechanisms. 

The  Pi  abstract  machine  is  fine-grain  (i.e.,  thousands  of  nodes,  small  task  sizes).  It 
can  also  support  the  requirements  of  coarse-grain  models  and  machines  (i.e.,  tens 
of  nodes,  large  task  sizes).  Mixing  the  grain  size  of  models  and  machines  produces 
varying  results.  It  is  difficult  to  efficiently  execute  small  tasks  on  a  machine  designed 
for  large  tasks,  since  the  task  overhead  is  unacceptable  high.  However,  large  tasks  are 
efficiently  executed  on  a  small  task  machine  with  its  low  task  overhead.  Unfortunately, 
larger  task  models  have  less  potential  for  parallelism. 

The  Pi  abstract  machine  is  message  driven,  although  other  choices  are  possible  (e.g., 
communication  by  shared  memory,  task  invocation  by  remote  procedure  calls).  Mes¬ 
sage  pcissing  was  selected  because  it  provides  a  convenient  and  efficient  casting  of 
most  machine  mechanisms. 

This  chapter  examines  several  aspects  of  the  abstract  machine,  including  storage, 
synchronization,  communication,  task  management,  locality,  and  sequential  operation 
support. 


2.1  Storage 

In  Pi,  storage  is  represented  as  logical  collections  of  data  or  segments.  A  segment 
contains  a  number  of  related  pieces  of  information.  It  is  uniquely  named  on  the  node 
where  the  storage  resides.  A  segment  name  is  used  to  reference  the  stored  data. 
Each  location  in  a  segment,  referenced  by  an  offset,  contains  a  value  which  can  be  a 
number,  symbol  or  boolean.  Pi  does  not  specify  how  these  data  types  are  supported. 
However,  a  few  special  values  (e.g.,  UNBOUND)  are  referred  to  in  this  definition. 
Segments  are  distributed  across  all  nodes  of  the  abstract  machine. 


Each  segment  is  named  by  a  segment  ID.  Pi  does  not  specify  how  segment  IDs  are 
supported  in  an  implementation.  An  machine  implementation  can  use  the  physical 
address  of  the  segment  base  for  this  purpose.  If  this  technique  is  employed,  storage 
compaction  is  problematic  if  segment  IDs  are  copied  outside  the  node.  An  alter¬ 
nate  implementation  separates  segment  naming  and  storage  allocation  by  providing 
a  translation  between  names  and  absolute  pointers.  This  technique  does  not  require 
the  restriction  mentioned  above  for  storage  compaction.  However  a  translation  mech¬ 
anism  (albeit  a  simple  one)  is  necessary. 

ALhough  a  storage  hierarchy  is  not  part  of  the  abstract  machine,  it  is  not  precluded 
from  implementations  of  Pi.  Normally,  storage  access  costs  reflect  the  benefit  of 
reference  locality. 

There  are  many  alternatives  to  representing  storage  as  segments.  On  a  abstract  node, 
memory  could  be  addressed  as  a  single  linear  array.  However,  this  erodes  the  model 
machine  abstraction  and  restricts  the  implementation  of  memory  in  a  Pi  substrate. 
Memory  could  be  separated  from  the  processing  component  of  an  abstract  node  (i.e., 
separate  memory  and  processing  nodes).  This  approach  restricts  the  exploitation  of 
spatial  locality  above  the  interface.  It  also  couples  communication  and  storage  access 
operations,  preventing  explicit  control  of  communication  above  the  interface. 

Two  type  of  segments  are  supported  in  Pi:  read-write  and  associative 


2.1.1  Read- Write  Segments 

Read- Write  segments  are  finite  blocks  of  linearly  addressed  storage  locations.  They 
are  accessed  through  familiar  read  and  write  operations: 

(read  segment  offset) 

This  operation  returns  the  value  in  the  specified  segment  at  the  specified 
offset. 

(write  segment  offset  new-value) 

This  operation  writes  the  new  value  into  the  specified  segment  and  offset. 
Read-Write  segments  are  created  using  the  create-read-write-segment  operation: 
(create-read-write-segment  size) 

This  operation  allocates  a  read-write  segment  of  the  requested  size  (in 
words),  and  returns  a  segment  name. 

Segment  names  are  only  created  through  segment  allocation  operations.  Once  gen¬ 
erated,  they  are  immutable.  Address  arithmetic  is  only  possible  on  segment  offsets, 
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and  offsets  must  fall  within  the  segment  boundaries.  Once  a  read-write  segment  is 
created,  its  size  cannot  be  changed. 

Read- Write  segments  support  the  basic  storage  naming  mechanism  in  Pi:  the  tu¬ 
ple  <Node,  Segment,  OfFset>  or  NSO.  NSOs  are  referred  to  throughout  this 
document  as  handles  to  storage  locations.  More  general  naming  mechanisms  (which 
support  segment  migration  for  example)  can  be  supported  by  Pi,  but  are  not  explicitly 
part  of  the  interface. 


2.1.2  Associative  Segments 

Associative  segments  are  a  second  type  of  storage  supported  in  Pi.  They  are  provided 
in  addition  to  read- write  segments  because  they  constitute  a  different  access  model 
of  storage.  Associative  segments  could  be  supported  on  top  of  the  interface  using 
read-write  segments.  But  then  the  purpose  of  the  storage  would  be  obscured  or  lost, 
preventing  the  machine  builder  from  providing  the  best  implementation  of  associative 
segments  on  the  available  hardware.  Pi  attempts  to  capture  the  intent  of  the  model 
builder  without  specifying  all  the  details. 

.Associative  segments  provide  access  to  segment  data  via  the  following  operations: 

(insert  segment  key  new-value) 

This  operation  inserts  the  value  into  the  segment  with  the  specified  key. 

If  the  new  entry  cannot  be  inserted  into  the  segment  (because  of  size 
bounds),  the  key  of  the  existing  entry  (which  caused  the  collision)  is  re¬ 
turned.  Otherwise  the  key  of  the  inserted  entry  is  returned.  If  the  new 
value  is  specified  as  UNBOUND,  the  appropriate  key  is  returned,  but  the 
segment  is  unmodified. 

(match  segment  key) 

This  operation  returns  the  value  associated  with  the  key  in  the  segment. 

The  symbol  UNBOUND  is  returned  if  the  key  is  not  found  in  the  segment. 

(remove  segment  key) 

This  operation  removes  the  key  and  associated  value  from  the  segment. 

The  old  value  is  returned  if  the  entry  existed  in  the  segment.  Otherwise 
UNBOUND  is  returned. 

(clear  segment) 

This  operation  clears  all  entries  from  an  associative  segment. 
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Associative  segments  have  two  important  parameters  that  are  specified  when  the 
segment  is  created:  Size  and  Safety.  The  size  parameter  indicates  whether  the  segment 
size  is  fixed  or  not.  It  is  either  BOUNDED  (fixed  at  some  value),  or  UNBOUNDED^ 
(able  to  grow  to  some  relatively  large  size,  with  respect  to  the  system  storage).  The 
safety  parameter  indicates  whether  segment  entries  can  be  lost,  usually  as  a  result  of 
another  insertion.  An  associative  segment  is  either  SAFE  (entries  cannot  be  lost),  or 
UNSAFE  (entries  can  be  discarded  without  notice). 

These  parameters  are  orthogonal.  Here  are  the  behaviors  of  the  four  combinations: 
BOUNDED-SAFE 

This  segment  type  guarantees  that  the  segment  size  is  not  increased  and 
that  existing  entries  are  not  overwritten.  This  is  accomplished  only  by 
refusing  new  insertions.  When  an  insertion  collision  occurs,  the  insertion 
operation  returns  the  key  of  the  existing  entry.  This  key  can  be  used  to 
safely  remove  the  entry  before  retrying  the  insertion.  This  segment  type 
requires  the  implementation  to  use  a  repeatable  replacement  policy. 

UNBOUNDED-SAFE 

This  segment  type  guarantees  safe  acceptance  of  all  insertions,  by  increas¬ 
ing  the  size  of  the  segment.  An  insert  operation  is  normally  not  rejected, 
nor  is  an  existing  entry  overwritten. 

BOUNDED-UNSAFE 

This  segment  type  accepts  all  insertions  without  increasing  the  segment 
size.  It  may  unsafely  overwrite  an  existing  entry.  This  behavior  is  similar 
to  that  of  hardware  caches. 

UNBOUNDED-UNSAFE 

This  segment  type  behavior  is  underspecified,  since  an  insertion  collision 
can  be  resolved  either  by  increasing  the  segment  size,  or  by  overwriting  an 
existing  entry.  This  affords  the  machine  dependent  compiler  and  runtime 
system  flexibility  to  better  optimize  behavior. 

Pi  places  no  restrictions  on  how  entries  are  mapped  into  or  replaced  in  an  associative 
segment  beyond  those  described  above.  As  a  consequence,  a  segment  of  size  N  does 
not  consider  size  and  safety  issues  at  the  (N  -f-  l)th  insertion.  A  collision  can  occur 
as  early  as  the  second  insertion. 

Also,  UNBOUNDED  segments  only  guarantee  safe  acceptance  of  insertions  within 
the  bounds  of  specific  machine  limitations.  If  this  limit  is  exceeded,  an  exception 
results.  This  error  is  similar  in  nature  to  a  divide  by  zero  exception;  it  is  not  normally 
expected. 

An  associative  segment  size  is  specified  even  for  UNBOUNDED  types,  to  give  the 
machine  specific  compiler  an  estimate  of  the  expected  size. 

^An  UNBOUND  size  parameter  is  different  than  em  UNBOUND  value.  The  latter  means  the 
value  is  unknown  or  uninitialized. 
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Associative  segments  are  created  using  the  Create- Associative-Segment  operation, 
(create-associative-segment  size  type) 

This  operation  creates  an  cissociative  segment.  The  size  parameter  speci¬ 
fies  the  desired  initial  size  (number  of  key- value  bindings).  The  type  pa¬ 
rameter  specifies  the  boundedness  and  safety  of  the  segment.  The  segment 
type  is  one  of  the  following:  BOUNDED-SAFE,  UNBOUNDED-SAFE. 
BOUNDED-UNSAFE,  and  UNBOUNDED-UNSAFE. 

2.1.3  Storage  Reclamation 

Explicit  storage  deallocation  is  supported  in  Pi  via  the  Destroy- Segment  operation: 
(destroy-segment  segment) 

This  operation  frees  the  storage  in  the  specified  segment.  Any  segment 
type  can  be  freed  using  this  operation. 

Pi  neither  requires  nor  precludes  automatic  storage  reclamation  (garbage  collection). 
This  decision  is  left  to  the  machine  compiler.  However  certain  applications  expressed 
in  some  programming  models  perform  poorly,  or  not  at  all  without  automatic  storage 
reclamation. 

It  is  often  necessary  to  determine  the  size  of  a  read- write  or  associative  segment.  The 
Segment-Size  operation  supports  this  need. 

(segment-size  segment) 

This  operation  returns  the  size  of  a  read-write  or  associative  segment. 


2.1.4  Nodals 

A  segment  can  be  accessed  only  if  the  segment  name  is  known.  However  there  are 
many  applications  where  it  is  expedient  to  have  globally  “named”  segments  and 
values.  This  is  supported  via  node  variables  or  nodals.  They  are  analogous  to  global 
variables  in  sequential  computing,  except  a  copy  of  each  variable  exists  on  each  node. 
Nodals  are  globally  named  but  locally  consistent  (i.e.,  a  nodal’s  value  is  not  consistent 
across  nodes).  When  a  nodal  is  defined,  an  independent  copy  of  that  named  storage 
location  is  created  on  every  node.  They  are  maintained  in  a  special  nodal  segment 
on  each  node. 

(nodals) 

This  operation  returns  the  name  of  the  segment  containing  the  nodals 
for  the  current  node.  When  a  nodal  is  defined,  it  is  assigned  a  location 
in  the  nodal  segment.  Nodals  can  then  be  accessed  via  read-write  seg¬ 
ment  accessor  operations,  using  this  operation  to  access  the  node’s  nodal 
segment. 
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2.2  Synchronization 


Synchronization  in  parallel  system  is  composed  of  two  components:  a  piece  of  named 
synchronization  state,  and  an  inter-node  communication  mechanism  to  allow  access 
of  that  state.  In  Pi,  communication  is  provided  by  the  message  passing  operation 
which  is  presented  later  in  Section  2.3. 

The  named  state,  which  indicates  the  status  of  synchronization  events  (both  data 
and  control),  is  maintained  in  specially  annotated  storage  locations  (locations  are 
annotated  using  the  attribute  operation  described  in  Section  2.2.4).  In  this  way, 
synchronization  event  naming  is  supported  via  the  existing  storage  naming  facility 

(NSO). 

.A^s  an  alternative  to  this  approach,  the  naming,  storage,  and  communication  require¬ 
ments  of  synchronization  could  be  combined  in  a  single  synchronization  operation. 
However,  synchronization  communication  patterns  vary.  Sometimes  hundreds  of  tasks 
are  involved,  sometimes  only  two.  Synchronizing  tasks  may  be  located  on  different 
nodes,  or  they  may  be  on  one  node.  It  is  difficult  capture  these  different  communica¬ 
tion  patterns  in  a  general  operation.  By  separating  the  communication  component, 
it  can  be  handled  using  the  same  techniques  as  general  communications.  By  main¬ 
taining  synchronization  state  in  read-write  storage  locations,  synchronization  naming 
and  access  is  simplified. 

Several  forms  of  synchronization  are  explicitly  supported  in  Pi.  Three  forms  are 
considered  in  detail:  data,  barrier,  and  producer-consumer  synchronization. 

2.2.1  Data  Synchronization 

Data  svnchronization  is  supported  by  storage  locations  annotated  as  d-syncs.  This 
is  similar  to  other  data  synchronization  mechanisms  such  as  futures  in  Multilisp  [33], 
context  futures  in  CST  [19],  I-Structures  in  Id  [6],  and  full/empty  bits  in  HEP  [53]. 
When  a  data  synchronization  is  created,  it  is  marked  as  not  yet  containing  valid 
data.  Any  attempt  to  read  it  results  in  the  suspension  of  the  active  task.  A  d-sync 
maintains  a  ordered  set  of  suspended  tasks,  waiting  for  the  location  to  be  written. 
When  a  d-sync  is  written,  the  read  lock  is  cleared,  permitting  read  accesses  of  the 
location.  Suspended  tasks  waiting  to  read  the  location  are  requeued  for  execution 
in  the  order  they  suspended.  All  locations  in  newly  created  read-write  segments  are 
annotated  as  d-syncs  by  default. 

2.2.2  Barrier  Synchronization 

A  barrier  synchronization  or  b-sync  is  a  control  synchronization  mechanism,  similar 
to  counting  semaphores  described  by  Dijkstra  [21].  A  b-sync  maintains  a  count  which 
is  adjusted  (incremented  or  decremented)  and  tested.  When  a  b-sync  is  tested,  the 
active  t<isk  suspends  if  the  barrier  is  positive.  A  b-sync  maintains  a  ordered  set  of 
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tasks  which  have  suspended  on  the  barrier.  VV'hen  a  barrier  synchronization  count  is 
reduced  to  a  non-positive  value,  suspended  tasks  waiting  on  the  barrier  are  requeued. 
If  the  barrier  count  is  increased  to  a  positive  number,  tasks  will  once  again  suspend 
if  they  test  the  count. 

Two  operations  are  provided  to  support  barrier  synchronization:  test- count  and 
adjust-count. 

(test-count  segment  offset) 

This  operation  tests  the  count  of  a  barrier  synchronization.  If  it  is  positive, 
the  active  task  is  suspended  and  added  to  an  ordered  set  of  tasks  waiting 
on  the  barrier. 

(adjust-count  segment  offset  count-change) 

This  operation  adjusts  the  barrier  count  by  count-change,  count-change 
can  be  positive  or  negative,  allowing  the  required  count  to  be  increased  or 
decreased.  If  the  count  becomes  zero  or  negative,  suspended  tasks  waiting 
on  the  barrier  are  requeued  in  the  order  they  suspended. 


2.2.3  Producer-Consumer  Synchronization 

A  producer-consumer  synchronization  or  s-sync  is  a  combined  data  and  control  syn¬ 
chronization.  An  s-sync  guarantees  that  each  each  value  written  to  the  synchro¬ 
nization  location  is  read  exactly  once.  It  is  useful  for  the  support  of  several  model 
mechanisms  including  lock  step  execution  of  a  set  of  tasks,  often  required  in  data 
parallelism. 

An  s-sync  initially  behaves  like  a  d-sync,  prohibiting  reads  until  the  location  has  been 
written.  After  it  has  been  written,  write  attempts  are  suspended  until  the  s-sync 
has  been  read.  Then  the  cycle  repeats.  This  behavior  guarantees  that  a  value  is  not 
overwritten  before  it  is  read,  or  read  twice. 


2.2.4  Attributes 

All  synchronizations  are  supported  in  read- write  storage  locations.  Storage  locations 
are  designated  as  synchronizations  by  attributes.  Each  word  of  a  read-write  segment 
is  annotated  with  an  attribute  which  governs  the  permissible  access  patterns  for  that 
location.  Read-Write  locations  are  initially  attributed  as  d-syncs.  This  attribute 
can  be  changed  using  the  attribute  operation  to  one  of  the  seven  attribute  tyi>es 
summarized  in  Figure  2.1. 

(attribute  segment  offset  attribute-type) 
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Figure  2.1:  Synchronization  Attribute  Summary 
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This  operation  attributes  the  specified  segment  location.  The  following 
attribute  types  are  supported:  D-SYNC,  WRITE-ONCE.  S-SYNC,  B- 
SYNC,  READ-ONLY,  WRITE-ONLY,  and  READ-WRITE.  When  a  lo¬ 
cation  is  attributed,  all  suspended  tasks  on  the  location  are  requeued  in 
the  order  they  suspended. 

It  is  often  desirable  to  check  w'hether  a  local  memory  access  operation  is  going  to 
result  in  a  "uspension  without  actually  executing  it.  The  probe  operation  is  provided 
for  this  purpose. 

(probe  segment  offset  access-operation) 

This  operation  is  a  predicate,  indicating  whether  the  specified  access  op¬ 
eration  will  result  in  a  suspension  of  the  active  task.  Access  operations 
include:  read,  write,  and  test-count. 

Global  time  is  valuable  for  supporting  certain  model  mechanisms  including  temporal 
synchronization.  Global  time  is  provided  by  the  time  operation. 

(time) 

This  operation  returns  the  global  time  of  the  system. 

2.3  Communication 

All  communication  requirements  of  Pi  are  satisfied  by  a  single  mechanism:  message 
passing.  A  messages  passing  cycle  is  composed  of  the  following  steps: 

1.  A  message  is  injected  into  a  network  by  a  task  on  the  source  node.  The  message 
destination,  length,  priority,  and  type  are  sent  (in  that  order),  followed  by  a 
variable  number  of  arguments. 

2.  The  communication  network  routes  the  messages  to  the  correct  destination. 
The  destination  field  of  the  message  is  removed  during  delivery. 

.3.  The  destination  node  receives  the  first  field  of  the  message,  the  message  length. 
It  then  allocates  storage  (a  read-write  segment)in  local  node  memory  for  the 
messages. 

4.  The  destination  then  accepts  the  remainder  of  the  message  from  from  the  net¬ 
work  and  stores  it  into  the  allocated  segment. 

5.  The  message  is  then  scheduled  as  a  task  using  the  task  ID  (the  ID  of  the 
segment  containing  the  message)  in  a  task/message  queue.  The  specific  queue 
is  specified  by  the  priority  field.  Teisk  scheduling  and  invocation  is  discussed 
further  in  Section  2.4. 


There  are  many  alternatives  to  this  approach  for  supporting  communication.  For 
example,  communication  and  task  invocation  can  be  separated  into  two  operations. 
The  local  task  invocation  operation  is  similar  to  Pi’s  call  operation,  described  in 
Section  2.4.5.  For  the  communication-only  operation,  naming  of  the  communicated 
data  becomes  problematic.  A  complex  scheme  is  required  to  allow  a  task  on  the 
destination  node  to  connect  with  the  appropriate  received  data.  All  of  the  approaches 
(e.g.,  scanning  a  message  arrival  queue,  supporting  a  shared  address  space,  etc.) 
require  a  substantial  mechanism  overhead.  By  combining  communication  and  remote 
task  invocation,  the  invoked  task  knows  the  name  of  the  communicated  data  since 
it  is  included  in  the  task  segment.  If  communication-only  messages  are  required,  a 
simple  task  which  stores  the  data  in  the  appropriate  location  can  be  employed. 

.\nother  alternative  is  to  support  more  complicated  communication  operations  like 
remote  reads  and  writes.  However,  this  complexity  is  not  always  required,  resulting 
in  wasted  communication  resources.  Also,  this  mechanism  can  be  easily  synthesized 
using  message  passing.  Most  often,  a  simple  but  general  operation  is  preferable  to  a 
more  complicated  one. 

The  simplest  operation  does  not  always  provide  acceptable  performance.  For  ex¬ 
ample.  another  communication  alternative  is  an  operation  that  transmits  one  word 
of  data  to  an  adjacent  neighbor.  Since  each  “hop"  in  the  communication  network 
requires  node  intervention,  this  creates  unacceptable  communication  latency,  and  un¬ 
necessarily  burdens  the  nodes  in  the  routing  path.  Additionally,  the  arbitrary  source 
and  destination  message  passing  operation  in  Pi  provides  a  more  abstract  view  of  the 
communication  network,  allowing  greater  implementation  freedom  for  the  machine 
builder. 

single  operation  supports  message  sending  in  Pi: 

(send  destination  length  priority  type  arg-1  arg-2  arg-3  .  .  .  arg-n) 

This  operation  constructs  a  message  containing  the  send  arguments  and 
enqueues  it  for  delivery  to  the  named  destination  node.  During  trans¬ 
mission.  destination  is  removed  from  the  message  by  the  network.  At  the 
destination  node,  storage  (a  segment)  is  allocated  for  the  incoming  mes¬ 
sage,  using  length.  The  message  fields  type  and  priority  specify  how,  and 
in  what  order  the  message  is  handled  by  the  receiving  node.  Message 
handling  is  described  in  a  later  section. 

When  a  message  arrives  at  a  node,  it  is  enqueued  as  a  segment  in  the  node's  storage. 
In  the  Pi  abstract  machine,  this  happens  independently  of  the  node’s  current  activities 
(i.e.,  a  task  executing  on  the  node  is  not  suspended  when  a  new  message  is  received). 


2.3.1  The  Network 

The  communication  network  is  capable  of  delivering  messages  between  a  sender  and 
receiver  node.  The  details  of  the  delivery  (routing,  in-transit  buffering,  etc.)  are  not 
visible  in  the  interface.  However,  the  network  exhibits  the  following  properties: 


24 


•  Message  delivery  is  guaranteed. 

•  Message  latency  is  indeterminate. 

•  Message  delivery  order  between  two  nodes  is  not  preserved. 

•  The  network  can  exert  backpressure^  on  message  sending  nodes. 

2.4  Task  Management 

Communication  and  task  management  are  closely  related  in  Pi.  Message  sending  is 
one  of  two  mechanisms  for  creating  tasks  (the  other  is  task  calling).  All  node  activity 
stems  directly  from  message  reception.  This  section  considers  several  aspects  of  task 
management  in  the  Pi  abstract  machine. 

2.4.1  Task  Storage 

Normally,  all  task  storage  is  allocated  at  message  arrival  time.  When  a  message  is 
received,  the  task  storage  requirement  is  computed  as  the  sum  of  the  message  length 
field  and  the  task  storage  overhead  size  (i.e.,  space  for  processor  state,  etc.).  A 
segment  of  this  size  is  allocated,  and  the  message  is  stored  there. 

In  some  models,  the  required  task  size  cannot  be  determined  statically.  In  these 
dynamic  cases,  an  additional  segment  is  allocated  at  runtime  to  provide  the  required 
storage. 

Task  naming,  like  synchronizations,  is  supported  by  segment  names  (NSO).  Therefore, 
the  allocation  of  task  storage  includes  the  assignment  of  a  unique  name  to  the  task: 
the  segment  name.  This  name  is  obtained  via  the  self  operation. 

(self) 

This  operation  returns  the  segment  name,  which  is  used  as  part  of  a  NSO 
address. 

2.4.2  Task  Dispatch 

Both  mechanisms  for  task  invocation,  send  and  call,  use  a  type  field  to  specify  the 
appropriate  handler.  Depending  on  the  machine  implementation,  this  field  can  con¬ 
tain  one  of  two  value  types:  a  physical  address  of  a  code  segment  or  the  segment  ID 
of  the  code  segment  (if  this  is  implemented  differently).  This  distinction  (address  or 
ID)  can  be  made  at  machine  compile  time  or  run  time  (using  type  tagging).  In  either 
case,  the  code  segment  must  be  resident  on  the  node. 

-This  means  that  the  network  can  halt  nodes  that  are  injecting  messages. 


25 


To  support  handlers  which  are  not  guaranteed  to  be  resident  (i.e.,  non-resident  han¬ 
dlers),  a  separate  argument  in  the  message  or  called  task  specifies  a  name  which  can 
be  resolved  to  a  segment  ID  using  a  user  defined  translation  mechanism.  After  this 
translation  is  accomplished  (and  the  non-resident  handler  is  copied  locally),  call  is 
invoked  to  begin  handler  execution.  An  example  of  non-resident  handler  support  is 
described  in  Chapter  3. 


2.4.3  Task  Atomicity 

Task  execution  is  non-preemptive.  A  task  executes  until  termination,  or  until  a 
synchronization  point  forces  a  suspension.  Non-busy  wait  techniques  (explicit  wake- 
up)  are  employed  to  resume  suspended  tasks. 

-A  task  can  voluntarily  suspend  itself,  perhaps  allowing  higher  priority  tasks  (defined 
in  the  next  section)  to  execute. 

(suspend) 

This  operation  suspends  execution  of  the  current  task.  It  is  immediately 
requeued  for  execution. 

A  task  is  terminated  when  the  associated  segment  is  destroyed. 


2.4.4  Task  Prioritization 

Task  execution  ordering  is  supported  via  task  queues.  FIFO  ordering  in  these  queues 
is  required,  since  non-FIFO  ordering  can  result  in  livelock  and  deadlock  in  certain 
circumstances,  and  excessive  task  buffering.  This  guarantees  that  tasks  which  suspend 
on  a  synchronization  are  requeued  in  the  order  they  suspended.  If  non-FIFO  buffering 
is  employed  (for  example  FILO),  two  taisks  can  enter  a  livelock  condition  arbitrating 
for  a  synchronization^. 

Since  the  Pi  abstract  machine  is  fine-grained,  each  node  normally  has  many  tasks 
waiting  to  execute.  In  this  circumstance,  simple  arrival  ordering  of  tasks  is  inadequate, 
since  the  waiting  time  for  critical  tasks  becomes  long.  Therefore,  multiple  task  queues 
are  used  to  allow  more  critical  tasks  to  begin  ahead  of  other  earlier  tasks. 

When  a  message  is  received,  the  priority  field  determines  where  it  should  be  enqueued. 
Priorities  can  be  positive  or  negative  integers,  where  zero  is  “normal”  by  convention. 
These  queues  determine  an  execution  ordering  where  no  task  in  queue  N  is  started 
until  all  queues  greater  than  N  are  empty.  However,  once  a  task  begins,  it  executes  to 
suspension  or  completion  regardless  of  the  priority  of  any  received  messages.  There 
are  no  task  interrupts. 

^This  condition  was  actually  produced  in  a  demonstration  example  used  in  Chapter  3.  An  early 
version  of  a  simulator  was  erroneously  coded  with  stack  task  queues.  A  livelock  condition  resulted. 
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Multiple  task  queues  are  recommended  but  not  required  of  machine  implementa¬ 
tions.  Machine  designers  are  free  to  support  a  single  queue,  using  resources  for  other 
features.  Because  of  uncertainty  in  message  delivery  time,  models  should  employ 
multiple  queues  for  performance  improvements  rather  than  for  cissuring  functional 
correctness. 

2.4.5  Calling  Tasks 

Call  is  the  second  Pi  mechanism  for  creating  tasks.  It  is  syntactically  similar  to  a 
sequential  procedure  call.  The  new  task  is  executed  on  the  same  node.  Also,  the 
caller  task  is  suspended  in  a  special  way,  so  that  no  other  node  tasks  can  intervene 
when  the  new  task  is  called  or  aftci  it  completes.  Calling  a  task  behaves  differently 
than  sending  a  message  to  the  current  node.  With  a  message  send,  other  messages 
can  execute  before  or  after  the  new  task,  disrupting  the  caller  task’s  atomicity. 

Call  guarantees  atomic  execution  of  a  code  sequence,  while  allowing  code  sharing. 
Without  it.  code  sequences  would  need  to  be  macro-expanded  at  each  call.  With 
calls,  the  machine  compiler  ran  choose  to  macro-expand  calls  independently,  while 
supporting  Pi’s  semantics. 

(call  length  type  arg-1  arg-2  arg-3  .  .  .  arg-n) 

This  operation  invokes  a  handler  on  the  current  node.  The  length  field 
specifies  the  required  storage  size  in  the  task  segment.  Type  specifies 
the  handler.  The  task  is  executed  immediately,  with  no  opportunity  for 
waiting  tasks  to  run.  When  the  called  task  completes,  the  caller  task  is 
immediately  reinvoked. 

Since  called  tasks  are  executed  atomically  within  the  caller,  a  called  task  can  directly 
return  a  result  to  the  caller  t^Lsk,  rather  than  the  more  complicated  alternative  of 
sending  a  reply  message.  If  a  return  operation  is  executed  at  any  point  during  the 
called  task  the  result  value  is  returned  from  the  Call  operation.  The  returned  result 
need  not  be  read  by  a  caller  task.  Additionally,  non-called  tasks  (resulting  from 
jnessage  passing)  can  execute  the  return  operation  with  no  effect. 

(return  return-value) 

This  operation  stores  return-value  away,  to  be  returned  later  when  the 
called  handler  completes.  If  a  return  operation  is  executed  more  than 
once,  the  last  value  is  returned. 


2.4.6  Variable  Argument  Passing 

Pi  supports  handlers  which  accept  a  variable  number  of  arguments.  However,  the  send 
and  call  operations  pass  a  fixed  number  of  parameters.  A  variant  of  these  operations 
exists  for  variable  argument  passing. 
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(send-segment  length  type  segment) 

This  operation  is  similar  to  send,  except  the  arguments  are  taken  from  the 
specified  segment. 

(call-segment  length  type  segment) 

This  operatioii  is  similar  to  call,  except  the  arguments  are  taken  from  the 
specified  segment. 


2.5  Locality 

Many  types  of  locality  can  be  exploited  in  a  parallel  problem  solving  system.  The  only 
concern  here  is  identifying  what  facilities  are  required  in  this  architectural  interface 
to  support  locality.  For  example,  as  stated  earlier.  Pi  neither  requires  nor  precludes 
exploitation  of  storage  reference  locality.  It  is  handled  below  the  interface  and  is  not 
part  of  Pi. 

Spatial  locality  in  node  selection  requires  support  in  Pi  since  the  communication 
topology  is  abstracted  below  the  interface.  Yet  information  about  spatial  locality  is 
often  available  above  the  interface. 

Several  alternatives  for  specifying  spatial  locality  were  considered.  However,  no 
scheme  allowed  general  spatial  specification  without  placing  constraints  on  the  un¬ 
derlying  substrate  topology. 

Pi  provides  three  operations  to  support  node  spatial  locality: 

(node-id) 

This  operation  returns  the  node  ID  of  the  current  node.  This  supports 
the  specification  of  NSO  handles,  and  locality  designation  for  tasks  which 
should  be  executed  on  the  current  node. 

(another-node-id  where) 

This  operation  returns  a  node  ID  of  a  node  in  the  system.  The  locality  of 
the  node  is  determined  by  the  where  parameter  which  is  one  of  the  follow¬ 
ing:  NEAR,  FAR,  or  ANY.  NEAR  indicates  that  requested  ID  should  be 
near  to  the  current  node.  FAR  indicates  that  a  distant  node  is  required. 

ANY  indicates  that  any  node  in  the  system  (including  the  current  node) 
is  appropriate.  ANY  assumes  a  random  node  ID  selection.  A  Pi  imple¬ 
mentation  may  use  this  information  to  balance  storage  or  compute  usage, 
but  this  is  not  required. 

(distance  node-x  node-y) 

This  operation  computes  the  distance  between  node-x  and  node-y.  The 
units  of  the  measurement  are  unspecified,  but  they  are  proportional  to 
communication  latency  between  the  nodes.  This  operation  is  most  valu¬ 
able  for  making  relative  comparisons  of  locality  (e.g.,  what  node  is  closest). 
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2.6  Sequential  Operations 


Beyond  the  operations  already  described.  Pi  provides  a  base  of  sequential  operations. 
Even  the  most  ambitious  parallel  models  are  beginning  to  recognize  this  need  [5,  23, 
44l.  The  following  operations  are  included; 


(plus  X  y)  (minus  x 

(not  x)  (or  X  y) 

(compare  x  y)  (rotate  x 

(branch-zero  test-variable  label) 
(branch-plus  test-variable  label) 
(branch-minus  test-variable  label) 


y) 

y) 


(times  X  y)  (divide  x  y) 

(and  X  y)  (xor  x  y) 

(arithmetic-shift  x  y)  (logical-shift  x  y) 
(branch-not-zero  test-variable  label) 
(branch-not-plus  test-variable  label) 
(branch-not-minus  test-variable  label) 


Most  of  these  operations  have  obvious  meanings.  The  compare  operation  compares 
two  values  (say  and  E),  and  returns  1  if  X  <  0  if  X  =  and  -1  if  X  >  Y. 

Using  compare  and  the  branch  operations,  all  branch  types  can  be  formed. 

Sequential  operations  pose  a  dilemma  in  Pi.  The  design  of  Pi  has  avoided  including 
things  in  the  specification  which  are  irrelevant  to  its  purpose  as  a  parallel  architectural 
interface.  Therefore,  a  complete  set  of  sequential  operations  is  not  specified  in  Pi. 
The  definition  of  such  a  set  is  outside  the  scope  of  this  thesis. 

However  it  is  impractical  to  construct  a  complete  sequential  operation  set  using  the 
operations  specified  above.  For  example,  if  hardware  which  computes  square  root  is 
present  in  hardware,  one  would  not  want  to  have  a  software  square  root  routine  built 
from  the  operations  above. 

Therefore,  the  set  of  sequential  operations  included  in  Pi  is  intentionally  left  open. 
Future  work  in  the  design  of  a  sequential  architectural  interface  will  complete  this 
aspect  of  Pi. 


2.7  Summary 

This  chapter  has  defined  Pi,  a  parallel  architecture  interface.  Pi  is  based  on  a  set  of 
generic  parallel  model  requirements.  Pi  provides: 

•  operations  to  support  both  linearly  indexed  and  associatively  addressed  storage 

•  several  forms  of  data  and  control  synchronizations 

•  a  communication  operation 

•  support  for  task  management 

•  support  for  topology-independent  specification  of  locality 
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operation  class 

Pi  operations 

storage  (read- write) 

read,  write 

storage  (associative) 

insert,  match,  remove,  clear 

storage  management 

create-read-write-segment,  create-associative-segment, 
destroy-segment,  segment-size 

storage  (node  variables) 

nodals 

synchronization 

attribute,  probe,  time,  test-count,  adjust-count 

communication 

send,  send-segment 

task  management 

seif,  suspend,  call,  call-segment,  return 

locality 

node-id,  another-node-id,  distance 

sequential  operations 

plus,  minus,  times,  divide,  not,  or,  and,  xor, 
compare,  rotate,  arithmetic-shift,  logical-shift, 
branch-zero,  branch-not-zero,  branch-plus, 
branch-not-plus,  branch-minus,  branch-not-minus 

Table  2.1:  Pi  Operation  Summary 
•  sequential  operations 
Table  2.1  summarizes  the  Pi  operations. 

Pi  hcLS  been  guided  by  evaluation  criteria  described  in  Chapter  1.  Using  these  criteria, 
it  is  hoped  that  the  interface  is  appropriately  abstract,  accurate,  well-matched,  and 
unbiased  about  model  and  machine  implementation.  Yet  this  can  only  be  determined 
by  studying  the  interface  in  use.  The  remainder  of  this  thesis  examines  the  effective¬ 
ness  of  this  interface  in  supporting  model  mechanisms,  and  in  being  supported  by  a 
machine  substrate. 
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Chapter  3 

Building  Models  on  Pi 


A  programming  model  is  a  collection  of  problem  solving  tools.  The  tools  are  abstract, 
and  they  are  usually  not  provided  directly  in  Pi.  Instead,  this  abstract  functionality 
is  embodied  in  model  mechanisms  constructed  from  interface  operations.  A  set  of 
model  mechanisms  must  be  constructed  for  each  model  supported  in  Pi. 

In  this  chapter,  Pi’s  ability  to  support  several  model  mechanisms  is  examined.  A 
broad  range  of  examples  is  presented  to  demonstrate  various  aspects  of  mechanism 
construction  including  task  management,  naming,  synchronization,  storage  manage¬ 
ment,  communication,  locality,  and  sequential  sequence  support.  This  chapter  exam¬ 
ines  four  specific  mechanism  examples  and  two  simple  applications,  including: 

•  shared  memory  (with  caches) 

•  set  synchronization 

•  object  name  translation 

•  non-resident  handler  support 

•  n-body  simulation 

•  relaxation  simulation 

Since  Pi  is  an  architectural  interface,  not  a  programming  language.  Pi  programs  are 
very  low  level,  with  many  details  exposed.  Programming  in  Pi  is  analogous  to  pro¬ 
gramming  in  a  machine-independent  parallel  assembly  language.  Task  management, 
naming,  synchronization,  storage  management,  communication,  and  locality  issues 
are  all  explicitly  defined  in  the  programs. 

High  level  language  programming  environments  are  necessary  to  develop  application 
programs  on  a  Pi  machine.  They  provide  a  convenient  specification  environment 
with  model  mechanisms  presented  cleanly  and  abstractly.  High  level  languages  and 
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handler  name 


parameters 


local  variables  handler  code 


(define-handler  Fact  (N)  (X) 

(print-user  ''~&factorial  (~d)  =  "  (read  (self)  N)  ) 
(write  (self)  X  1) 

(branch-minus  (compare  (read  (self)  N)  2)  Done) 

Loop 

(write  (self)  X  (times  (read  (self)  X)  (read  (self) 
(write  (self)  N  (minus  (read  (self)  N)  1)) 
(branch-not-minus  (compare  (read  (self)  N)  2)  Loop) 
Done 

(print-user  "~d~4”  (read  (self)  X)) 

(destroy-segment  (self) ) ) 


N)  )  ) 


Figure  3.1;  Factorial  Handler 


compilers  are  absent  from  this  thesis  because  (a)  their  specification  and  construction 
is  an  open  research  issue  in  parallel  computing,  and  (b)  the  goal  of  the  thesis  is  to 
show  that  the  model  mechanisms  which  underlie  high  level  languages  can  be  efficiently 
supported  by  an  architectural  interface. 

The  model  examples  have  been  implemented,  tested,  and  metered  using  a  Pi  simu¬ 
lator  described  in  the  next  section.  This  chapter  describes  their  implementation.  In 
Chapter  4,  the  results  of  these  executions  are  presented  and  evaluated.  The  code  for 
all  examples  is  provided  in  Appendix  A. 


3.1  PiSim:  A  Pi  Simulator 

It  is  difficult  to  construct  mechanisms  in  Pi  without  an  execution  environment.  For 
this  reason,  a  Pi  simulator,  PiSim,  was  created.  PiSim  supports  the  operations  defined 
in  Chapter  2.  It  provides  a  way  to  define,  debug,  and  meter  Pi  programs.  PiSim 
includes  a  syntax  for  expressing  Pi  programs.  This  section  presents  a  short  description 
of  the  PiSim  environment. 


3.1.1  Handlers 

Pi  programs  are  collections  of  message  handlers.  Handlers  in  Pi  are  analogous  to 
procedures  in  a  sequential  program.  A  handler  consists  of  a  sequence  of  operations 
which  are  sequentially  executed  on  a  node  in  response  to  a  message.  A  handler  also 
specifies  the  state  (e.g.,  parameters  and  locals)  used  during  the  execution. 

PiSim  cissumes  that  the  code  for  each  handler  is  present  on  all  nodes  (support  for 
non-resident  handlers  is  presented  in  an  example  later  in  this  chapter).  A  simple 
handler  for  computing  factorial  is  shown  in  Figure  3.1. 
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In  a  compiler  generated  Pi  program,  parameters  and  local  variables  are  not  mnemon- 
icaily  named  as  in  this  example.  A  handler  specification  only  includes  the  number 
of  parameters  and  required  locals.  These  variables  are  named  in  PiSim  to  improve 
program  readability.  When  a  variable  name  is  used  in  a  handler,  it  is  translated  by 
PiSim  to  the  appropriate  offset  into  a  segment.  When  a  handler  accepts  a  variable 
number  of  parameters,  its  parameters  are  referred  to  by  offsets  directly  rather  than 
names. 

Sending  a  message  is  similar  to  calling  a  procedure.  Handlers  are  invoked  by  send  and 
call  operations  in  other  handlers.  When  a  message  is  received  on  a  node,  the  message 
type  is  used  to  select  the  appropriate  handler  to  service  the  message.  That  handler 
is  then  invoked  on  the  message.  Figure  3.2  illustrates  a  factorial  handler  invocation. 

PiSim  also  provides  an  operation  to  inject  a  start  message  into  the  system; 

(inject  'handler-name  arg-1  arg-2  arg-3  .  .  .  arg-n) 

This  PiSim  operation  sends  the  message  to  a  random  node  at  time  zero. 

Since  this  operation  computes  the  message  length,  it  cannot  be  used  to 
invoke  handlers  which  accept  a  variable  number  of  arguments.  When  a 
message  is  injected  into  the  system,  all  simulated  state  is  first  initialized. 

This  prevents  previous  state  from  affecting  the  execution  of  a  program, 
thus  improving  repeatability  for  debugging. 

PiSim  is  implemented  using  a  single  priority  message/task  queue  on  each  node.  As 
explained  in  Chapter  2,  this  is  a  valid  implementation  of  the  interface.  Multiple 
message  priorities  have  not  been  used  in  the  examples. 


3.1.2  Nodals  and  Constants 

.An  example  of  a  nodal  and  constant  definition  is  shown  in  Figure  3.3.  Constants 
are  often  used  in  the  examples  to  define  a  data  structure.  The  constants  define  slot 
offsets  in  a  segment. 

Variables  in  PiSim  are  automatically  typed.  Automatic  typing  is  not  required  in  Pi, 
although  PiSim  provides  no  way  of  declaring  a  variable’s  type. 


3.1.3  Special  Operations 

The  Pi  examples  presented  in  this  chapter  contain  a  few  additional  operations  which 
are  not  included  in  the  Pi  definition  presented  in  Chapter  2.  These  operations  are 
not  formally  part  of  Pi,  but  they  are  useful  for  presentation  clarity. 

(print-user  format  arg-1  arg-2  arg-3  .  .  .  arg-n) 

This  operation  prints  the  arguments  using  the  LISP  format  string. 
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Figure  3.2;  Handler  Invocation 


nodal  name 


constant  name 


constant  value 


; ;  This  nodci^>v^^ctors  the  ^ap  of  chunk  obj^^s  on  a  node, 
(define-nodal  Chunk-Map) 

; ;  This  constant  defines  ^he  size  o^the  cache  map. 

(def ine-constant  Cache-Map-Size  16) 

Figure  3.3;  Nodal  and  Constant  Definition 


(set-debug-level  level) 

This  operation  sets  the  debug  level,  which  controls  the  amount  of  debug¬ 
ging  information  provided  by  PiSim. 

These  sequential  operations  are  used  in  the  examples.  Since  PiSim  supports  them 
with  the  Common  LISP  equivalent,  a  detailed  definition  of  their  operation  can  found 
in  [32]. 


(random  x)  (exponent  x  y)  (mod  x  y)  (ceiling  x  y) 


3.1.4  PiSim  Implementation 

This  PiSim  implementation  also  includes  several  program  evaluation  and  metering 
functions  including  operation  profiling,  task  and  storage  statistics,  and  concurrency 
information.  These  facilities  are  used  to  evaluate  these  mechanism  examples  in  Chap¬ 
ter  4. 

PiSim  is  implemented  in  Common  Lisp  using  Sun  Common  Lisp  window  and  graphics 
extensions.  It  is  expressed  in  roughly  2000  lines  of  code.  The  examples  in  this  chapter 
execute  onaSun4/60ina  range  of  minutes  to  several  hours. 


3.2  Shared  Memory  with  Caches 

This  section  describes  an  approach  for  supporting  a  shared  memory  model  with 
caches,  using  Pi.  This  scheme  has  not  been  evaluated  using  application  trace  data. 
Its  purpose  is  to  exemplify  the  requirements  of  medium-grain  to  fine-grain  shared 
memory  systems*  (e.g.,  [2,  9,  11,  12,  22,  35,  36]).  This  example  demonstrates  efficient 
shared  memory  mechanisms  constructed  with  Pi. 

'This  task  is  complicated  by  the  lack  of  message  order  preservation  in  Pi. 
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3.2.1  A  Shared  Address  Space 


The  shared  memory  model  requires  the  support  of  a  logically  contiguous  address 
space  built  out  of  distributed,  discontinuous  memory.  The  basic  unit  of  the  model  is 
a  line,  a  small  contiguous  piece  of  memory  that  is  managed  as  an  atomic  element  of 
storage.  Each  line  is  mapped  into  r  unique  place  in  the  shared  address  name  space. 
The  number  of  words  contained  in  a  line  is  normally  a  power  of  two. 

Since  shared  memory  it  too  large  to  fit  on  a  single  node,  it  is  divided  into  pieces  which 
are  distributed  across  all  the  nodes  in  the  system.  Each  piece  is  called  a  chunir.  In  this 
implementation,  storage  for  the  entire  shared  memory  is  allocated  when  the  system 
is  initialized,  rather  than  being  allocated  on  an  as-required  basis.  Chunk  storage  is 
composed  of  a  set  of  line  sized  segments.  A  chunk  map  is  constructed  on  each  node 
to  maintain  the  locations  of  line  segments  allocated  there.  Chunks  are  composed  as 
collections  of  small  objects  (lines)  rather  than  a  single  large  array  to  simplify  accesses 
using  the  segment  paradigm  in  Pi. 

.\ccessing  a  shared  memory  address  requires  locating  the  correct  line  on  the  correct 
node.  There  are  a  variety  of  ways  to  map  node  and  line  identification  into  the  shared 
address.  In  this  demonstration  example,  a  simple  scheme  has  been  chosen. 

.-\  shared  memory  address  is  composed  of  the  following  fields: 

MSB  LSB 

node  ID  index  offset 

Starting  from  the  least  significant  bit,  the  first  field  is  the  line  offset.  This  identifies 
a  word  in  the  line.  This  is  followed  by  the  index  field  which  specifies  a  line  in  the 
chunk  map  of  a  node.  Finally,  the  node  ID  field  indicates  the  location  of  the  shared 
address  storage. 

The  node  ID  field  is  actually  the  node  number  of  an  abstract  machine  node:  there 
is  no  indirection.  This  precludes  the  relocation  of  a  specific  shared  memory  loca¬ 
tion.  Supporting  shared  memory  relocation  requires  a  distributed  name  translation 
mechanism.  However  it  is  not  clear  that  this  capability  justifies  the  added  complexity. 
Caches  themselves  provide  a  form  of  relocation.  The  functionality  may  be  appropriate 
to  support  virtual  shared  memory  addresses. 

.\ddress  braiding  is  often  used  in  shared  memory  systems.  This  is  the  technique 
of  dividing  and  intermixing  the  node  and  index  fields  to  scatter  sequential  address 
sequences  (e.g.,  large  objects)  across  several  nodes.  It  is  not  employed  in  this  example, 
although  it  has  been  implemented  in  a  simpler,  non-caching  scheme.  The  requirements 
and  overhead  of  supporting  braiding  are  evaluated  later.  Address  braiding  is  omitted 
in  this  scheme  for  clarity. 

Caches  take  advantage  of  locality  to  avoid  remote  accesses  on  each  shared  memory 
reference.  When  a  word  of  shared  memory  is  accessed,  the  entire  line  in  which  it 

^Other  names  (e  g.,  page,  segment,  block)  were  considered  and  dismissed  because  they  implied 
unintended  connotations  about  the  implementation. 
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resides  is  obtained  and  stored  in  the  cache  to  exploit  spatial  locality.  Words  are 
retained  in  the  cache  for  future  accesses  to  exploit  temporal  locality. 

Like  chunk  storage,  caches  are  constructed  using  a  set  of  lines.  However  the  cache 
map  (analogous  to  the  chunk  map)  is  realized  using  associative  storage.  In  order 
to  maintain  cache  consistency,  cached  copies  of  a  shared  memory  line  are  chained 
together  forming  a  list.  The  permanent  (chunk)  line  maintains  a  pointer  to  the  head 
of  this  list,  so  that  the  copies  can  be  invalidated  when  necessary. 

The  details  of  lines  and  maps  are  presented  in  the  next  section. 


3.2.2  The  Components 

Each  node  in  the  system  contains  a  local  cache  and  a  chunk  of  the  shared  memory. 
The  cache  and  chunk  are  constructed  using  three  segment  types:  lines,  chunk  maps, 
and  cache  maps. 

.-\  line  is  the  basic  storage  element  in  this  shared  memory  implementation.  It  is 
realized  as  a  read- write  segment  containing  twelve  words.  The  segment  is  composed 
of  the  following  fields: 

Store:  This  field  contains  the  eight  words  of  data  maintained  in  this  line. 

Link:  For  chunk  lines,  this  field  indicates  the  node  containing  the  firsu  cache  copy 
of  this  line.  For  cache  lines,  this  field  holds  the  node  containing  the  next  copy 
of  this  line.  If  no  copies  exist,  or  if  this  is  the  last  copy  in  the  chain,  this  slot 
contains  the  value  END. 

Status:  This  field  maintains  the  status  of  the  line.  A  chunk  line  status  can  be  one 
of  the  following:  UNLOCKED,  LOCKED.  READ-ONLY,  or  READ-WRITE. 
Cache  line  status  can  be  one  of  the  following:  INVALID.  READ-ONLY,  or 
READ- WRITE. 

Request:  This  field  is  used  by  the  cache  lines  to  serialize  cache  line  requests.  When 
a  cache  line  is  not  found,  a  memory  access  tries  to  acquire  this  field  of  the  cache 
line.  If  it  is  locked  (via  the  READ-ONLY  attribute),  the  request  suspends  until 
it  is  released.  Once  this  lock  is  obta'ned,  it  is  held  until  the  access  is  complete, 
guaranteeing  that  other  accesses  will  not  interfere. 

Barrier:  This  field  is  used  for  synchronization  between  the  cache  and  chunk  line 
during  an  access  operation.  It  is  implemented  as  a  b-sync. 

Two  map  segments  exist  on  each  node  to  translate  shared  memory  addresses  to  line 
segments.  The  chunk  map  is  a  read-write  segment  which  .laps  an  index  for  a  node's 
chunk  to  a  line  containing  the  data.  The  cache  map  is  an  cissociative  lookup  table 
keyed  on  shared  memory  addresses  (minus  the  line  offset).  It  translates  to  a  cache  line. 
If  the  line  status  is  READ-ONLY  or  READ-WRITE  on  a  read,  or  READ-WRITE 
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on  a  write,  the  cache  hits.  If  the  line  status  is  INVALID  for  a  read  or  INVALID  or 
READ-ONLY  for  a  write,  the  cache  misses. 

The  cache  map  must  be  safe  (i.e.,  entries  are  not  overwritten  by  other  insertions). 
Cache  maps  are  also  bound  in  size  to  control  the  amount  of  local  memory  used  by 
the  cache.  Because  of  these  two  facts,  a  read  or  write  operation  may  result  in  the 
replacement  of  the  current  cache  contents. 

Cache  lines  are  locked  via  the  request  field  of  a  cache  line.  Chunk  lines  are  locked 
using  the  status  field. 

3.2.3  The  Protocol 

The  shared  memory  scheme  implemented  in  this  example  is  defined  by  two  interacting 
protocols:  the  chunk  line  protocol  and  the  cache  line  protocol.  These  protocols  are 
presented  in  this  section. 

The  chunk  protocol  is  summarized  in  Figure  3.4.  It  defines  the  behavior  of  a  line 
of  chunk  memory.  There  are  three  normal  states:  Unlocked,  Read-Only,  and  Read- 
Write,  plus  several  locked  intermediate  states.  The  normal  states  have  the  following 
meaning: 

Unlocked  When  a  chunk  line  is  in  this  state,  there  are  no  cache  copies  on  any  other 
node.  If  an  access  request  is  made  to  a  line  in  this  state,  the  data  can  be 
furnished  immediately. 

Read-Only  This  state  indicates  that  one  or  more  read-only  cache  copies  exist  in 
other  nodes.  A  read  access  can  be  added  to  the  chain  of  linked  cached  copies. 
However  a  write  access  requires  that  all  read-only  cache  copies  be  invalidated 
first. 

Read-Write  This  state  indicates  that  a  single  read-write  cache  copy  is  contained 
by  another  node  in  the  system.  Any  access  to  this  line  must  be  preceded  by 
updating  the  chunk  with  the  state  of  the  exclusive  cache  copy. 


The  locked  states  in  the  protocol  are  intermediate  states  when  an  access  is  being 
completed.  When  in  these  states,  all  additional  accesses  are  blocked  until  the  current 
access  completes.  This  is  why  read  and  w’rite  messages  are  excluded  in  these  states. 
The  locked  states  have  the  following  meaning: 

Locked  Read  In  this  state,  a  read  access  is  in  progress.  This  state  is  entered  when 
the  line  data  hzis  been  replied  to  the  node  that  requested  the  line.  It  stays  in 
this  state  until  an  acknowledgment  is  received  from  the  requesting  node. 

Locked  Write  This  state  is  similar  to  Locked  Read,  except  an  exclusive  read-write 
copy  of  the  cache  has  been  replied  to  the  requesting  node. 
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Figure  3.4:  Chunk  Entry  Protocol 
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Locked  Invalidate  Write  When  a  write  access  is  received  and  the  current  state  is 
Read-Only,  all  read-only  copies  must  first  be  invalidated.  This  is  accomplished 
by  sending  an  invalidate  message  to  the  first  node  in  the  read-only  cache  chain. 
This  niessagc  is  forwarded  along  the  chain  until  all  copies  are  invalidated.  At 
this  point,  an  acknowledgment  message  is  returned  to  chunk  line,  changing  its 
state  to  Locked  Write. 

Locked  Flush  Read  When  a  read  access  is  received  and  the  current  state  is  Read- 
Write.  the  exclusive  cache  copy  must  first  be  flushed  (and  invalidated^)  before 
the  read-only  copy  can  be  replied  to  the  read  requester.  After  the  flush  message 
is  sent,  the  chunk  line  remains  in  this  state  until  a  flush  acknowledgment  message 
(with  the  updated  line  data)  is  received.  The  chunk  then  moves  to  the  Locked 
Read  state. 

Locked  Flush  Write  This  state  is  similar  to  Locked  Flush  Read,  except  a  write 
access  is  requested.  After  a  flush  acknowledgment  is  received,  the  chunk  line 
enters  the  Locked  Write  state. 

The  other  half  of  this  shared  memory  scheme  is  specified  in  the  cache  protocol  sum¬ 
marized  in  Figure  3.5.  It  defines  the  behavior  of  a  line  of  cache  memory.  In  the 
protocol,  accesses  are  separated  into  two  classes:  those  that  match  the  address  con¬ 
tained  in  the  luie,  and  those  that  don’t.  Reads  and  writes  that  match  are  printed 
in  italics.  Reads  and  writes  that  access  a  different  address  are  printed  in  boldface. 
This  protocol  includes  three  normal  states:  Invalid.  Read-Only,  and  Read-Write,  plus 
several  locked  intermediate  states.  The  normal  states  have  the  following  meaning: 

Invalid  When  a  cache  line  is  in  this  state,  the  cache  entry  is  empty.  The  empty 
line  continues  to  be  included  in  the  cache  map.  However,  it  can  be  acquired 
immediately  during  an  access. 

Read-Only  This  state  indicates  the  cached  copy  of  the  line  can  be  read  but  not 
written.  A  cache  line  in  this  state  results  in  read  hits  and  write  misses.  The 
cache  line  can  also  receive  an  invalidate  message,  which  causes  it  to  invalidate 
the  line  and  forward  the  request  to  the  next  node  in  the  cache,  or,  if  it  is  the 
last  line  in  the  chain,  it  sends  an  acknowledgment  to  the  chunk  line. 

Read-Write  In  this  state,  all  accesses  can  be  served  locally.  A  cache  line  in  this 
state  can  receive  a  flush  message  from  the  chunk  line.  Here,  it  sends  a  flush 
acknowledgment,  which  contains  the  line  update,  to  the  chunk  line.  The  cache 
line  then  enters  the  invalid  state. 

The  locked  states  of  the  cache  line  protocol  differ  from  those  of  the  chunk  line  protocol 
in  that  some  accesses  are  permitted  (not  blocked).  The  unlocked  states  have  the 
following  meaning: 

^Alternately,  the  Read-Write  cache  copy  could  be  converted  to  read-only  and  placed  on  the 
read-only  chain. 
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Figure  3.5:  Cache  Entry  Protocol 
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Locked  Read  In  this  state,  a  read  message  has  been  sent  to  the  chunk  line,  but  the 
replied  data  has  not  been  received.  This  state  implies  that  the  required  cache 
line  has  already  been  acquired.  In  this  state,  all  accesses  are  blocked  until  the 
request  data  is  received. 

Locked  Write  This  state  is  similar  to  Locked  Read,  except  a  write  message  has 
been  sent  to  the  chunk  line. 

Locked  Read-Only  Read  This  state  indicates  that  cache  line  contains  a  read-only 
copy,  but  different  read  accesses  require  the  same  cache  line.  A  remove  message 
has  been  sent  to  the  chunk  line  of  the  current  copy.  While  in  this  state,  read 
accesses  of  the  current  cached  line  can  complete.  All  other  accesses  block.  This 
state  is  left  when  an  invalidate  message  is  received  from  the  current  chunk  line. 

Locked  Read-Only  Write  This  state  is  similar  to  the  Locked  Read-Only  Read 
state  except  a  write  access  is  pending. 

Locked  Read- Write  Read  This  state  results  from  a  read  request  that  requires  a 
cache  line  currently  containing  an  exclusive  read-write  copy.  As  in  the  Locked 
Read-Only  states,  a  remove  message  has  been  sent  to  the  chunk  line  owning  the 
current  cache  line  contents.  In  this  state,  read  and  write  accesses  to  the  current 
cache  line  can  complete.  All  other  accesses  are  blocked. 

Locked  Read-Write  Write  This  state  is  sinndlar  to  Locked  Read- Write  Read  ex¬ 
cept  a  write  access  is  pending. 

In  both  the  chunk  and  cache  protocols,  messages  other  than  blocked  accesses  (read 
or  w’rite)  not  drawn  in  the  diagram  are  undefined  and  result  in  an  error.  A  complete 
cross-product  of  messages  and  states  is  presented  in  the  shared  memory  code  in 
Appendix  A. 

This  shared  memory  scheme  does  not  require  node  to  node  message  order  preservation 
on  the  network,  since  message  ordering  is  not  guaranteed  in  Pi.  Because  of  this,  both 
chunk  and  cache  lines  are  locked,  guaranteeing  exclusive  access.  This  creates  an 
additional  requirement  for  separate  instruction  and  data  caches,  since  a  deadlock 
situation  could  result  if  an  instruction  and  a  value  it  requests  require  the  same  cache 
line. 

A  virtual  shared  address  space  can  be  constructed  using  the  same  mechanisms  em¬ 
ployed  in  this  example.  A  page  field  is  added  to  the  address,  and  the  described  shared 
memory  scheme  serves  as  the  “physical”  memory.  When  the  page  misses  in  the  cache, 
and  it  is  not  present  in  physical  memory  (as  determined  by  the  page  table),  the  page 
is  fetched  from  disk.  Disks  appear  in  Pi  as  specially  designated  abstract  nodes  with 
larger  local  memories. 
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3.3  Set  Synchronization 


This  section  demonstrates  set  synchronization,  a  forni  of  intertask  control  synchro¬ 
nization  used  in  several  parallel  models  of  computation,  especially  data  parallelism. 
It  is  defined  as  follows: 

Consider  a  set  of  tasks  T\...Tm-  Each  task  71  contains  two  synchronization  events:  an 
enabling  event  Ei  and  a  dependent  event  /),.  A  set  synchronization  guarantees  that 
no  task  passes  its  dependent  event  until  all  tasks  have  reached  their  enabling  event. 
This  constraint  requires  that  Vi,  7),  >  Ej. 

Intuitively,  a  set  synchronization  keeps  a  set  of  related  tasks  “in  step”  within  the 
quantization  of  the  enabling  and  dependent  events.  After  all  tasks  have  reached  E, 
they  can  execute  independently  until  the  D  for  the  next  iteration  is  reached. 

Set  synchronization  is  often  used  for  data  parallel  iterative  algorithms.  In  these 
algorithms,  the  set  synchronization  guarantees  that  all  data  sets  are  executing  the 
same  iteration  of  a  procedure. 

The  perceived  importance  of  set  synchronization  varies  widely  across  the  parallel 
computing  community.  It  has  been  used  effectively  to  implement  many  applications, 
especially  ones  employing  data  parallelism.  These  applications  are  normally  executed 
on  machines  which  provide  a  fast  global  synchronization  mechanism  (e.g.,  SIMD 
machines  such  as  the  Connection  Max:hine  [56]). 

Some  argue  that  the  inclusion  of  a  hardware  global  synchronization  mechanism  is 
overly  restrictive,  particularly  when  synchronization  is  provided  at  each  instruction 
(SIMD).  Since  most  algorithms  require  less  frequent  synchronization,  program  execu¬ 
tion  is  unnecessarily  constrained.  The  technical  ramifications  of  hardware  supported 
global  synchronization  are  often  serious  because  of  the  required  communication.  Also 
multiple  independent  set  synchronizations  are  not  possible  on  these  machines.  Oth¬ 
ers  argue  that  global  synchronization  is  subsumed  by  data  synchronization  (e.g., 
dataflow).  However,  some  programming  models  employ  global  synchronization  to 
eliminate  the  need  for  data  dependency  analysis. 

A  global  synchronization  mechanism  is  not  explicitly  included  in  the  Pi  abstract 
machine.  Because  of  the  difficulties  and  liabilities  of  a  hardware  implementation  (e.g., 
long  cycle  times  for  global  signals  to  propagate,  restrictions  of  machine  scalability), 
global  synchronization  hardware  support  is  not  expected  in  machine  designs.  Instead, 
set  synchronization  (including  global  synchronization)  is  supported  on  top  of  the  Pi 
abstract  machine.  Synchronization  time  is  not  constant  as  in  SIMD  machines.  It 
is  a  function  of  (a)  the  log  of  the  number  of  elements  in  the  set,  (b)  the  distance 
between  elements  in  the  set.  and  (c)  message  traffic  (network  traffic  and  message 
queue  lengths).  While  this  overhead  may  be  greater  than  that  of  SIMD  machines,  it 
is  paid  only  at  required  synchronization  points.  The  set  synchronization  overhead  is 
partially  masked  if  tasks  perform  work  that  overlaps  the  synchronization  after  E  but 
before  D.  Pi  also  provides  simultaneous  multiple  set  synchronization  not  supported 
in  SIMD  machines. 
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Figure  3.6:  Set  Synchronization  Overview 

3.3.1  The  Components 

This  example  demonstrates  set  synchronization  in  Pi.  A  set  synchronization  is  used 
to  guarantee  lock  step  execution  of  iterations  of  a  procedure  on  several  nodes.  Imple¬ 
menting  a  real  application  using  set  synchronization  would  unnecessarily  complicate 
the  example  and  obscure  the  mechanism  being  illustrated.  Instead,  a  simple  count¬ 
ing  loop  serves  as  the  “work”  of  each  iteration.  However,  a  real  task  would  perform 
correctly  in  place  of  the  loop. 

Synchronization  is  accomplished  via  a  synchronization  tree  in  which  the  iteration 
tasks  are  leaves.  The  structural  elements,  or  stems,  distribute  start  signals  from  the 
root  and  combine  finish  signals  from  the  leaves.  Figure  3.6  illustrates  this  assembly. 

The  leaf  behavior  is  summarized  in  Figure  3.7.  It  begins  by  waiting  for  a  start  signal 
from  its  parent  (a  stem).  When  the  start  signal  arrives,  the  leaf  begins  execution  of  an 
iteration  of  a  procedure.  In  this  example,  this  ttisk  is  a  simple  counting  loop.  However 
a  more  complicated  procedure  could  be  undertaken  which  communicates  with  other 
leaves,  performs  computation  on  a  piece  of  a  distributed  data  set,  or  executes  run-time 
selected  sub-procedures.  When  a  task  reaches  a  defined  point  in  the  computation,  it 
sends  a  finished  signal  to  its  parent.  This  point  is  normally  at  the  end  of  the  iteration, 
although  a  task  may  perform  additional  computation  before  waiting  for  the  next  start 
signal. 

Stem  objects  form  the  structure  of  the  synchronization  tree.  Their  behavior  is  illus¬ 
trated  in  Figure  3.8.  Stems  distribute  and  combine  signals  between  the  leaves  and 
the  root.  A  stem  begins  waiting  for  a  start  signal  from  its  parent.  When  it  arrives, 
the  stem  forwards  the  signal  to  its  children.  An  s-sync  is  used  to  guarantee  that  each 
start  signal  is  relayed.  Then  the  stem  adjusts  a  b-sync  to  receive  the  expected  number 
of  finish  signals  coming  from  its  children.  The  stem  then  waits  for  all  finished  signals 
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Figure  3.7:  Leaf  Behavior 
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Figure  3.8:  Stem  Behavior 
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from  child 

Figure  3.9:  Root  Behavior 

to  be  received.  After  the  last  signal  is  received,  a  finished  signal  is  forwarded  to  the 
parent. 

The  execution  of  the  tree  is  orchestrated  by  the  root.  It  behavior  is  shown  in  Figure 
3.9.  It  starts  the  execution  of  an  iteration  by  sending  a  start  signal  to  the  top  stem.  It 
then  waits  for  a  finished  signal  indicating  that  all  leaves  have  completed  the  iteration. 
It  then  sends  a  start  signal  for  the  next  iteration. 

In  this  example,  data  is  not  transmitted  in  the  start  and  finish  signals.  However, 
this  feature  can  support  applications  requiring  broadcasted  data  for  each  iteration, 
or  data  combination  and  collection  to  the  root.  For  example,  the  finish  signal  may 
include  information  used  by  the  root  to  determine  the  data  broadccist  in  the  next 
start  signal. 

This  example  employs  a  general  tree  building  algorithm  to  create  an  optimal  tree  of 
a  specified  set  size.  The  stem  fanout  can  be  adjusted  to  achieve  the  desired  balance 
between  communication  delay  (tree  depth)  and  message  serialization  at  the  stems 
(branching  factor).  When  the  root  determines  that  the  final  iteration  has  been  com¬ 
pleted,  it  sends  a  special  end  signal.  When  this  signal  is  received,  each  element  of  the 
tree  deallocates  all  storage  and  ends  (i.e.,  the  algorithm  is  self  cleaning). 

3.4  Object  Name  Translation 

This  example  demonstrates  objecf*  name  translation.  This  mechanism  translates 
an  object  name  to  the  location  of  the  object  (e.g.,  a  node  number).  Several  name 
translation  mechanisms  are  employed  in  other  examples  in  this  chapter.  For  example, 
in  the  n-body  example  (described  later  in  this  chapter),  a  translation  table  for  all 
objects  is  maintained  on  a  single  node.  In  the  n-body  example,  the  communication 
patterns  are  static,  so  all  translations  are  performed  during  the  initialization  phase. 

^An  object  can  be  a  segment  or  task  (or  both). 
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Figure  3.10;  Translation  Overview 

Shared  memory  demonstrates  two  other  types  of  translation.  A  ll.ne  in  the  shared 
address  space  is  algorithmically  translated  to  the  node  number  where  it  resides.  When 
received  at  the  correct  node,  the  address  is  translated  to  a  segment  ID  via  a  table. 
The  shared  address  space  is  itself  a  translation  mechanism  of  addresses  to  values. 
It  supports  consistent  caching  so  translations  can  change  over  time.  However  the 
translation  data  itself  is  fixed.  Also,  name  allocation  is  restricted  by  a  rigid  connection 
between  logical  names  and  physical  addresses. 

In  the  n-body  and  shared  memory  examples,  objects  are  fixed.  Therefore  no  provision 
is  made  for  a  translation  to  change.  Once  a  translation  is  known,  it  assumed  to 
be  correct  forever.  The  translation  mechanism  provided  by  shared  address  space 
supports  an  important  additional  capability.  Objects  are  allowed  to  migrate.  Object 
migration  requires  that  the  translation  system  maintains  consistent  information  about 
the  locations  of  objects  even  as  the  objects  move  from  place  to  place.  However, 
guaranteeing  this  consistency  has  a  high  synchronization  and  communication  cost. 
Since  an  incorrect  object  translation  can  be  detected  easily,  a  less  expensive  technique 
can  be  employed. 

In  this  example,  a  logical  segment  namespace  is  supported  which  (a)  allows  segment 
migration,  and  (b)  the  segment  name  is  not  rigidly  connected  to  a  physical  address. 
The  first  feature  provides  more  flexible  storage,  task,  and  communication  manage¬ 
ment.  The  second  feature  supports  more  general  name  allocation  techniques,  and 
more  flexible  name  resolution. 
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3.4.1  The  Components 


The  three  components  in  this  system,  Maps,  Agents,  and  Clients,  are  shown  in  Figure 
3.10.  Maps  store  the  translations  in  the  system.  Each  map  is  a  small  associatively 
addressed  table  of  name/ value  pairs.  Maps  can  grow,  and  have  some  mechanism  for 
dividing  if  they  become  too  large.  However  this  division  is  an  internal  process.  A 
map  continues  to  be  seen  as  a  single  (named)  object  by  the  other  system  components. 
Maps  are  distributed  randomly  throughout  the  system. 

Agents  are  the  intermediaries  between  maps  and  clients.  An  agent  is  the  “front  end” 
to  the  distributed  collection  of  maps.  Each  agent  maintains  a  table  of  the  current 
map  locations.  When  a  translation  transaction  is  received,  the  agent  performs  a  hash 
operation  on  the  name  to  select  the  appropriate  map  to  receive  the  request.  The 
agent  then  forwards  the  transaction  to  the  map. 

Agents  are  connected  in  a  ring®  for  communicating  special  transactions  which  involve 
all  agents.  These  include  map  location  updates  (used  both  in  system  initialization 
and  when  maps  migrate)  and  client  requests  for  the  closest  agent.  Using  the  agent 
ring,  new  agents  can  be  created  when  one  agent  becomes  overloaded.  The  agent 
simply  creates  a  new  agent  “downstream”  in  the  ring  and  passes  a  copy  of  the  map 
location  tables  to  it.  It  can  then  redirect  clients  to  the  new  agent. 

Clients  are  the  end  users  of  the  translation  system.  Each  client  maintains  the  location 
of  a  nearby  agent.  All  translation  transactions  are  directed  to  that  agent.  A  client 
also  maintains  a  local  translation  cache  which  eliminates  message  requests  for  recently 
accessed  translations. 

Maps  and  agents  are  created  at  system  initialization  time.  Clients  are  created  as 
required  by  an  application.  A  client  is  created  on  the  same  node  as  the  application 
ttisk  needing  access  to  the  translation  system.  More  than  one  client  may  be  created 
on  a  node  to  improve  cache  performance. 

When  a  new  client  is  created,  it  is  given  a  prototype  agent.  This  can  be  any  agent 
in  the  system.  As  part  of  the  client’s  initialization  procedure,  it  requests  the  nearest 
agent  from  the  prototype  agent.  This  way,  a  nearby  agent  is  acquired,  reducing 
network  traffic  and  transaction  latency. 

In  the  example  system,  map  division,  map  migration,  and  agent  spawning  are  not 
implemented.  However,  their  implementation  is  straightforward. 

This  translation  system  is  designed  to  maintain  object  name  -+  object  location  trans¬ 
lations.  Since  some  models  allow  object  migration,  translation  bindings  can  change 
over  time.  As  a  result,  translations  are  not  guaranteed. 

There  are  two  circumstances  which  result  in  incorrect  translation.  Since  client  caches 
are  not  updated  when  a  translation  is  changed,  they  can  contain  stale  data  which  is 
incorrect.  Or,  a  transaction  can  get  a  stale  translation  from  the  map  (a  new  location 
arrives  after  a  location  request). 

*If  the  number  of  agents  is  large,  a  lower  diameter  connection  structure  can  more  efficiently 
provide  this  function  (e.g.,  a  tree). 
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Because  of  the  possibility  of  incorrect  translations  being  supplied,  the  system  that 
uses  the  information  (e.g.,  message  delivery)  must  be  capable  of  detecting  errors.  One 
error  response  is  to  return  the  message  to  the  sender.  The  sender  then  requests  the 
trcinslation  again,  this  time  without  checking  the  client  cache.  The  returned  value 
updates  the  client  cache,  overwriting  a  possibly  stale  entry. 

In  this  implementation,  the  translation  system  supplies  the  node  where  the  object 
(a  segment)  resides.  The  message  is  then  sent  to  that  node  along  with  the  object 
name.  That  node  performs  a  second  translation  to  obtain  the  segment  number.  This 
is  the  mechanism  which  detects  incorrect  translations.  If  an  object  is  moved  within 
the  node  memory,  but  remains  on  the  node,  the  distributed  translation  address  does 
not  change.  Updates  only  result  from  internode  migration. 


3.5  Non-Resident  Handlers 

PiSim  assumes  that  the  code  for  all  handlers  resides  on  each  node.  For  many  models 
of  computation  (e.g.,  concurrent  object  oriented),  this  is  impractical.  There  are  two 
classes  of  handlers  supported  in  Pi:  resident  and  non-resident.  Resident  handlers  are 
small  in  number,  and  “wired  down”  on  every  node  (i.e.,  they  are  always  present). 
Non-resident  handlers  are  installed  on  a  node  as  they  are  required.  After  they  have 
been  used,  they  can  be  removed. 

This  example  demonstrates  support  for  non-resident  handlers.  In  this  scheme,  a 
single,  reference  copy  of  each  handler  (a  reference  handler)  resides  on  one  node  in 
the  system.  When  a  handler  is  required  to  handle  a  message  on  a  node,  a  copy  of 
the  reference  handler  is  obtained  first.  Handlers  are  cached  on  the  nodes  to  reduce 
handler  requests  and  improve  handler  dispatch  time.  Since  handlers  are  immutable, 
a  cache  consistency  scheme  is  not  required. 


3.5.1  The  Components 

An  overview  of  the  example  is  shown  in  Figure  3.11.  It  is  composed  on  five  pieces:  the 
reference  handler  map,  the  current  handler  map,  the  translation  client,  the  requested 
handler  set,  and  the  dispatch  handler. 

Reference  Handler  Map  One  reference  map  resides  on  each  node.  It  includes 
all  reference  handlers  that  are  maintained  on  that  node.  It  is  associatively 
addressed  using  the  non-resident  handler’s  name. 

Current  Handler  Map  Each  node  also  contains  a  current  handler  map.  It  includes 
all  cached  handlers  on  that  node.  The  size  of  this  map  can  be  bound  to  prevent 
cached  handler  storage  from  growing  too  large.  It  also  is  associatively  addressed 
using  the  handler’s  name. 
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Figure  3.11;  Non-Resident  Handler  Overview 


Translation  Client  This  non-resident  handler  example  uses  the  translation  mech¬ 
anisms  described  in  the  last  example  to  determine  the  location  of  a  required 
reference  handler.  The  translation  system  translates  a  non-resident  handler 
name  to  the  node  ID  on  which  it  resides. 

Requested  Handler  Set  Each  node  contains  a  requested  handler  set  which  main¬ 
tains  information  about  which  handlers  have  already  been  requested.  It  also 
provides  access  to  the  synchronization  object  which  requeues  dispatches  when 
the  handler  arrives. 

Dispatch  Handler  This  is  a  resident  (wired  down)  handler  which  pr.>vides  access 
to  the  non-resident  handler  mechanism. 


The  non-resident  handler  dispatch  procedure  is  shown  in  Figure  3.12.  This  example 
demonstrates  variable  argument  passing  in  Pi.  It  also  shows  how  synchronizations  can 
be  combined  with  associative  sets  (synchronization  objects  are  stored  in  the  requested 
handler  set  for  non-busy  waiting). 

The  next  two  sections  consider  collections  of  several  model  mechanisms  used  to  solve 
two  parallel  scientific  applications:  n-body  and  relaxation. 


3.6  N-Body  Simulation 

This  example  simulates  the  mutual  interaction  of  N  bodies  in  the  absence  of  exter¬ 
nal  forces.  Algorithmically,  this  is  a  straightforward  implementation  which  does  not 
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Figure  3.12;  Dispatch  Procedure 
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Figure  3.13:  N-Body  Overview 

incorporate  recent  approaches  to  the  problem  (e.g.,  Greengard  [24],  Zhao  [59]).  How¬ 
ever,  the  application  demonstrates  several  fundamental  model  mechanisms  required 
by  static  and  dynamic  dataflow,  concurrent  object  oriented,  and  CSP  models.  The 
program  simulates  the  two  dimensional  problem,  but  is  easily  generalized  to  three 
dimensions. 

Normally,  applications  would  not  be  writing  directly  in  Pi.  This  example,  and  the 
relaxation  example  which  follows,  demonstrate  how  application  requirements  are  met 
with  Pi.  A  language  compiler  would  ease  the  programming  and  debugging  teisk. 

3.6.1  The  Problem 

The  n-body  example  is  illustrated  in  Figure  3.13.  Consider  N  bodies  which  are 
randomly  positioned  in  a  plane.  Each  pair  of  bodies  is  acted  on  by  a  mutual  force 
between  them,  which  is  proportional  to  some  characteristic  of  the  bodies  (e.g.,  the 
body's  mass)  and  the  distance  between  them  (e.g.,  an  inverse  square  relationship). 
In  this  example,  gravitational  attraction  is  modeled.  The  force  between  two  bodies 
is  expressed  by  the  equation; 

|(X,-X,)|3 

If  there  are  N  bodies  in  the  system,  there  are  (TV  *  (TV  —  l)/2)  interactions.  For  this 
example,  the  ma^s  of  each  object  is: 

ObjectMass  =  BaseMass  4-  {ID  ■  DeltaMass) 

where  ID  is  the  object  number,  assigned  0 ...  (TV  —  1 ),  and  base  and  delta  masses  are 
parameters  of  the  system. 


52 


3.6.2  The  Components 

The  simulation  is  composed  of  two  basic  types  of  objects;  bodies  and  interactions. 
Body  objects  maintain  the  ID,  position,  velocity,  and  net  acceleration  of  the  body, 
and  pointers  to  the  —  1  appropriate  interactions.  Interactions  maintain  the  dis¬ 
tances  and  masses  of  the  two  bodies  participating  in  the  interaction,  and  compute  the 
net  force  and  resulting  acceleration  components.  Interaction  objects  also  maintain 
pointers  to  the  two  appropriate  body  objects. 

The  first  interesting  model  mechanism  demonstrated  by  this  example  occurs  in  the 
system  initialization  handler  (the  code  is  provided  in  Appendix  A).  Here,  the  handler 
must  interconnect  newly  created  body  and  interaction  objects  which  are  randomly 
placed  on  nodes  in  the  system.  Although  a  clever  node  encoding  scheme  could  reduce 
network  traffic  and  simplify  interconnect,  such  a  scheme  was  already  demonstrated 
in  the  shared  memory  example.  An  alternative  technique  is  used. 

The  startup  handler  creates  two  segments  to  hold  the  newly  created  node  object 
locations  (node  and  segment  ID).  As  the  node  creation  messages  are  sent,  the  des¬ 
tination  node  ID  is  stored  in  the  node  segment.  But  the  object  segment  ID  must 
be  returned  when  the  object  is  instantiated  on  the  remote  node.  After  the  startup 
handler  sends  out  the  body  initialization  messages,  the  are  followed  by  the  interaction 
initialization  messages.  These  messages  include  the  locations  of  the  bodies.  However, 
since  the  segments  containing  body  locations  are  initialized  as  d-syncs,  no  explicit 
synchronization  is  required  between  the  two  processes.  This  demonstrates  simple 
data  synchronization,  which  is  required  in  nearly  every  model. 

When  the  body  and  interaction  initialization  messages  are  received  on  the  selected 
node,  object  storage  is  allocated  and  initialized  "dth  the  formals  as  the  message  is 
removed  from  the  network  (i.e.,  the  message  storage  is  the  task  storage).  Storage 
requirements  which  are  not  known  at  compile  time  (e.g.,  interaction  location  storage 
maintained  by  bodies)  is  allocated  explicitly. 

When  body  and  interaction  objects  are  created,  d-syncs  support  rendezvous  synchro¬ 
nization.  Interaction  objects  send  their  locations  directly  to  the  appropriate  body 
objects.  When  these  locations  are  written  on  the  body  objects,  linkage  is  complete. 

The  body  and  interaction  initialization  handlers  include  the  main  simulation  loop. 
Therefore  no  additional  task  invocations  (other  than  the  communication  messages) 
are  required. 

The  behavior  of  a  body  is  described  in  Figure  3.14.  It  begins  by  sending  its  position 
to  each  interaction  in  which  its  involved.  It  then  waits  for  all  acceleration  components 
from  the  interactions.  It  calculates  a  new  velocity  and  position.  Then,  if  the  final 
iteration  count  hcis  not  been  reached,  it  begins  another  iteration. 

•An  interaction’s  task  is  shown  in  Figure  3.15.  It  calculates  the  distance  and  force 
between  two  objects.  Then  it  distributes  acceleration  components  to  the  two  bodies. 
Acceleration  rather  than  force  components  are  computed  since  the  interaction  requires 
the  body  m£isses  for  computing  the  interaction  force.  Body  objects  do  not  maintain 
their  mass.  An  interaction  also  tests  the  iteration  count  to  detect  termination. 
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Figure  3.14:  Body  Behavior 
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Figure  3.15:  Interaction  Behavior 
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Figure  3.16;  Relaxation  Overview 


Two  synchronizations  are  required  for  these  interaction  loops.  The  body  routine 
requires  a  barrier  synchronization  (b-sync)  to  determine  when  all  acceleration  com¬ 
ponents  are  received.  Before  the  position  updates  are  sent,  the  b-sync  is  set  to  the 
number  of  interactions.  When  all  updates  are  sent,  the  b-sync  count  is  tested.  This 
typically  results  in  suspension  of  the  body  loop.  When  an  acceleration  component  is 
received,  the  barrier  is  reduced  by  one.  The  last  component  requeues  the  body  loop 
to  continue. 

The  interaction  loop  employs  an  s-sync  to  guarantee  that  each  position  update  is  used 
exactly  once.  If  the  data  has  not  been  received,  the  interaction  loop  suspends  until 
it  is  written.  If  the  position  data  has  not  been  read  by  the  loop,  the  update  handler 
which  writes  the  value  suspends  until  it  has  been  read.  Access  is  synchronized  on  a 
per  datum  bases,  avoiding  unnecessary  synchronizations  between  objects. 

S-syncs  can  be  used  in  static  dataflow  to  eliminate  acknowledgments  without  placing 
restrictions  on  dataflow  graphs.  Only  flow  control  messages  are  required  to  avoid 
buffer  overflow. 

When  both  loops  end,  all  storage  is  reclaimed. 


3.7  Relaxation 

This  example  uses  a  relaxation  algorithm  to  computes  the  thermal  equilibrium  of  a 
rectangular  plate  in  contact  with  four  heat  sources  at  different  temperatures.  The 
plate  is  divided  into  an  array  of  elements.  Each  element  performs  several  iterations 
where  it  computes  its  new  temperature  as  the  average  of  its  four  neighbors. 


This  is  a  simple  relaxation  algorithm.  A  more  sophisticated  multi-grid  technique 
(e.g.,  Mol  [42])  is  more  suitable  for  real  applications. 


3.7.1  The  Components 

The  number  of  elements  created  in  the  simulation  is  specified  by  constants  in  the 
program.  Figure  3.16  illustrates  the  layout  for  the  10  by  10  plates  simulated  in  this 
example.  Each  element  is  identified  by  its  index  (the  number  in  the  element  box). 

A  single  task  constructs  all  the  elements  by  sending  a  start  message  to  different  nodes. 
As  in  the  shared  memory  example,  node  numbers  are  used  as  a  naming  scheme,  and 
as  a  way  to  spread  out  the  computation.  An  element  resides  on  node  N  where  N  is 
its  index  modulo  the  number  of  nodes  used  in  the  example.  An  element  computes  the 
address  of  its  neighbors  using  the  array  sizes.  It  also  determines  whether  it  borders 
on  the  plate  boundaries. 

Unlike  the  n-body  example,  an  element  never  gets  a  pointer  to  its  neighboring  ele¬ 
ment’s  segments.  Here,  the  creating  task  would  be  swamped  by  neighbor  segment 
requests.  Instead,  an  element  map  is  created  on  each  node.  It  performs  translations 
between  index  name  and  elements  residing  on  that  node.  So  an  element  reaches  its 
neighbors  by  computing  the  neighbor’s  node  and  sending  a  message  there.  When  the 
message  arrives,  the  destination’s  name  is  translated  using  the  node’s  element  map 
to  the  correct  segment. 

Once  all  elements  are  initialized.  ?y  begin  the  first  iteration  of  the  relaxation.  This 
procedure  is  shown  in  Figure  3.17.  Each  element  executes  a  specified  number  of 
iterations,  then  returns  its  final  temperature  to  the  creating  task  for  display. 

This  scheme  demonstrates  the  utility  of  s-syncs  in  regulating  producer/consumer 
synchronization.  Each  element  includes  an  s-sync  for  each  neighbor.  The  neighbors 
produce  temperature  updates  which  write  these  values.  The  element  consumes  the 
values  when  calculating  its  new  temperature. 

Because  of  this  algorithm’s  design,  an  s-sync  never  has  more  than  one  value  presented 
to  be  written  before  a  read  is  executed.  This  is  because  of  the  data  dependency 
cycle  between  neighboring  cells.  In  producer-consumer  situation  where  the  producer’s 
output  rate  is  independent  from  the  consumer’s  input  rate,  it  is  possible  to  have 
multiple  tasks  waiting  to  write  an  s-sync  with  a  produced  value.  Since  the  network 
does  not  preserve  message  order,  some  form  of  message  sequence  numbers  is  required 
to  prevent  produced  values  from  being  written  out  of  order.  When  the  producer 
generates  a  value,  it  also  assigns  it  a  sequence  number.  Before  these  produced  values 
are  written  to  an  s-sync,  the  value  sequence  number  is  first  tested  to  see  if  it  is  the 
next  produced  value. 

This  scheme  presents  a  problem:  how  does  an  element  know  when  its  neighbors  have 
been  initialized?  This  example  employs  a  spin-lock  technique  as  follows.  When  a 
temperature  update  arrives  on  a  node,  the  element  index  is  referenced  in  the  element 
map.  If  it  is  not  found,  the  task  suspends,  then  tries  again.  This  allows  other  tasks  to 
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Figure  3.17;  Element  Procedure 
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execute  on  the  node  before  the  next  query.  When  the  element  creation  task  installs 
the  requested  element  in  the  map,  the  update  task  accesses  it  and  completes. 

The  use  of  a  synchronization  object  in  the  associative  set,  used  for  non-resident  han¬ 
dlers,  could  also  be  used  here  instead  spin-locking.  Since  the  synchronization  object 
requires  no  busy  waiting,  it  is  computationally  more  efficient.  However,  when  a  short 
wait  is  expected,  the  simplicity  of  the  spin-locking  is  sometimes  appropriate. 

This  scheme  does  not  include  a  mechanism  to  determine  when  all  element  temperature 
changes  are  within  a  specified  epsilon.  There  is  a  straightforward  implementation 
using  the  combining  tree  from  the  set  synchronization  example.  This  tree  does  not 
serve  to  synchronize  the  iterations.  Instead,  it  periodically  (e.g.,  every  J  iterations) 
combines  temperature  changes  from  all  elements  to  determine  if  all  are  within  the 
specified  epsilon.  If  so,  the  tree  distributes  a  stop  iterating  message  to  all  elements. 
The  combining  tree  operates  independently  from  the  element  iterations. 


3.8  Summary 

This  chapter  demonstrates  how  mechanisms  from  different  computational  models  are 
represented  in  Pi.  One  of  the  interface  evaluation  metrics  described  in  Chapter  1 
is  functionality  match.  The  model  mechanisms  are  directly  implemented  using  Pi 
operations  (i.e.,  there  is  a  good  functional  match  between  Pi  operations  and  the 
model  requirements).  Figure  3.18  shows  the  parallel  issues  are  best  demonstrated  by 
the  examples.  It  also  shows  model  or  models  which  are  most  similar. 

In  the  next  chapter,  the  effectiveness  of  these  examples  is  considered.  Their  cost 
(in  terms  of  basic  parallel  operations)  is  examined  and  their  execution  behavior  is 
characterized. 
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Chapter  4 

An  Evaluation  of  Pi 


This  chapter  evaluates  the  model  mechanisms  and  applications  described  in  Chapter 
3.  Features  of  Pi  that  support  the  examples  are  discussed  and  some  key  code  frag¬ 
ments  are  considered.  Results  collected  during  the  execution  experiments  are  also 
presented  and  evaluated.  First,  the  evaluation  metrics  are  defined. 


4,1  Evaluation  Metrics 

One  way  to  evaluate  a  mechanism  implementation  is  by  examining  the  Pi  code.  Pi 
is  not  an  abstract  programming  language.  In  a  Pi  program,  parallel  operations  are 
explicit.  By  inspecting  the  code  that  implements  a  mechanism,  one  can  estimate  its 
cost  in  terms  of  parallel  operations  (e.g.,  communications,  synchronizations,  etc.). 

In  this  chapter,  operations  denote  Pi  operations  such  as  read,  self,  and  attribute.  Usu¬ 
ally,  many  operations  are  combined  to  form  a  single  logical  instruction.  For  example, 
the  Pi  code  fragment: 

(write  (self)  C  (plus  (read  (self)  A)  (read  (self)  B))) 

contains  seven  operations  (one  write,  one  plus,  two  reads,  and  three  seifs).  Yet  this  is 
considered  a  single  instruction  since  adding  two  values  is  considered  a  single  logical 
step.  Logical  instructions  are  a  more  meaningful  metric  when  considering  costs  on  a 
Pi  substrate.  They  are  used  as  the  main  unit  of  evaluation  rather  than  Pi  operations. 

Other  characteristics  about  the  examples  are  gathered  by  executing  them.  PiSim  is 
instrumented  to  collect  several  statistics: 

Total  Tasks  Executed  This  is  the  total  number  of  tasks  executed  during  a  trial. 
It  gives  a  measure  of  the  size  of  the  experiment.  Tasks  are  created  by  the  send, 
send-segment,  call,  and  call-segment  operations. 

Total  Instructions  Executed  This  is  the  total  number  of  logical  instructions  exe¬ 
cuted  during  a  trial.  This  is  another  measure  of  the  experiment  size. 
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Average  Instructions/Task  This  is  the  average  total  number  of  logical  instruc¬ 
tions  executed  by  each  task  (i.e.,  the  total  number  executed  independent  of 
task  suspensions).  This  metric  measures  task  grain  size. 

Average  Instruction  Run  Length  This  is  the  average  number  of  logical  instruc¬ 
tions  executed  by  a  task  between  suspensions.  For  example,  if  a  new  task 
executes  six  instructions,  suspends,  then  resumes  and  executes  four  more  in¬ 
structions  before  completing,  its  average  instruction  run  length  is  five  (ten  in¬ 
structions  /  two  running  periods).  This  metric  measures  execution  grain  size. 

Average  Operations/Instruction  This  is  the  average  ratio  of  Pi  operations  to  log¬ 
ical  instructions.  It  measures  the  average  complexity  of  the  logical  operations. 

Total  Read-Write  Segments  Created  This  is  the  total  number  of  read- write  seg¬ 
ments  created  during  a  trial.  This  parameter  includes  both  task  segments  re¬ 
sulting  from  calls  and  sends,  and  explicitly  created  data  segments.  This  metric 
measures  storage  required  by  a  trial. 

Average  Read- Write  Segment  Length  This  is  the  average  size  of  a  read-write 
segment  allocated  during  the  experiment.  It  includes  both  data-only  and  task 
segments.  This  metric  measures  the  storage  grain  size. 

PiSim  also  collects  instruction  type,  operation  type,  segment  size,  segment  age,  and 
task  run  and  wait  time  histograms.  A  detailed  log  of  each  experiment  is  provided  in 
Appendix  B. 

4.2  Shared  Memory  with  Caches 

The  definition  and  implementation  of  the  shared  memory  protocol  in  Pi  is  similar  to 
those  proposed  for  specialized  shared  memory  machines  [2,  9,  11,  12,  22,  35,  36].  The 
sequences  for  supporting  the  protocol  employ  the  same  communication  and  synchro¬ 
nization  operations. 

When  evaluating  Pi’s  implementation,  special  issues  that  must  be  addressed  include 
(a)  critical  sequences  in  the  protocol  where  machine  specialization  can  improve  the 
performance  (e.g.,  for  cache  read  hits),  and  (b)  'otocol  differences  resulting  from 
characteristics  of  machine  hardware  (e.g.,  message  order  preservation).  This  section 
discusses  these  issues. 

If  the  cache  hit  rate  is  high,  the  overhead  for  accessing  a  cache  hit  is  important  for 
memory  access  performance.  Figure  4.1  gives  the  code  for  a  shared  memory  read. 
On  a  cache  hit,  the  following  logical  instructions  are  executed:  two  and,  one  match, 
two  branch,  and  one  read.  So,  a  total  of  six  logical  instructions  are  executed  when  a 
cached  value  is  read. 

Above  the  interface,  this  cost  can  be  reduced  to  two  logical  instructions  (one  match 
and  one  branch)  by  adding  each  shared  memory  word  to  the  cache  separately.  Larger 
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;;  This  handler  executes  a  “read”  code  fragment. 

(def ine-handler  Read-Address  (Address)  (Base  Offset  Line) 

(vrite  (self)  Base  (and  (read  (self)  Address)  Base-Mask)) 

(write  (self)  Offset  (and  (read  (self)  Address)  Of f set -Mask) ) 

(write  (self)  Line  (match  (read  (nodals)  Cache-Map)  (read  (self)  Base))) 
(branch-zero  (compare  (read  (self)  Line)  UKBOUHD)  Get-Line) 

(branch-zero  (comp2u:e  (read  (read  (self)  Line)  Status-Offset)  INVALID) 
Get-Line) 

(return  (read  (read  (self)  Line)  (read  (self)  Offset))) 

(destroy-segment  (self)) 

Get-Line 

(write  (self)  Line  (call  8  Get-Cache-Line  (read  (self)  Base)  READ)) 
Read-Line 

(return  (read  (read  (self)  Line)  (read  (self)  Offset))) 

(attribute  (read  (self)  Line)  Request-Offset  READ-WRITE) 

(destroy-segment  (self))) 


Figure  4.1:  Read  Access  Code  Fragment 

lines  (e.g.,  eight  words)  can  still  be  requested  over  the  network  to  provide  spatial 
locality. 

Below  the  interface,  the  cost  of  supporting  read  hits  can  be  reduced  if  segment  caches 
are  provided  in  the  hardware  substrate.  Segment  caches  are  fast,  associatively  ad¬ 
dressed  read-write  segment  buffers  that  provide  very  low  latency  accesses.  The  ma¬ 
chine  dependent  compiler  can  transform  the  Pi  code  sequence  into  segment  cache 
references  that  provide  comparable  access  times  to  that  of  specialized  shared  memory 
machines.  In  addition,  segment  caches  also  improve  the  performance  of  other  model 
mechanisms. 

A  Pi  implementation  of  shared  memory  must  deal  with  two  issues  not  normally  ad¬ 
dressed  in  other  implementations.  First,  since  Pi  does  not  preserve  message  order 
between  nodes,  special  handling  is  required  to  guarantee  consistency.  Chunk  lines 
must  be  locked  for  a  longer  period,  and  an  extra  acknowledgment  must  be  sent  from 
the  cache  to  the  chunk. 

The  second  additional  issue  is  related  to  chunk  request  wait  time.  Since  Pi  does  not 
allow  tasks  to  be  interrupted  when  a  message  is  received,  a  shared  memory  access 
request  (e.g.,  a  chunk  request)  must  wait  until  the  tcisk  executing  on  the  node  suspends 
or  completes  before  it  can  be  serviced.  Prioritized  task  queues  allow  memory  requests 
to  execute  ahead  of  application  tasks.  However,  if  the  run  length  of  the  application 
task  is  long,  the  access  request  wait  time  can  become  unacceptably  long.  To  prevent 
this,  the  machine  compiler  must  insert  voluntary  suspend  operations  in  long  execution 
sequences  to  reduce  the  run  length.  If  the  suspension/  resumption  overhead  is  low, 
this  does  not  significantly  affect  execution  performance. 

A  non-caching  protocol  that  incorporates  address  braiding  is  also  included  in  Ap¬ 
pendix  A.  Braiding  requires  three  logical  instructions  to  decode  (unbraid)  the  address. 
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total  tasks  executed 

33,004 

total  instructions  executed 

329,359 

average  instructions/task 

10.0 

average  instruction  run  length 

8.1 

average  operations  per  instruction 

4.6 

total  read-write  segments  created 

33,770 

average  read-write  segment  length 

6.9 

Table  4.1:  Shared  Memory  Statistics 
but  these  operations  are  only  executed  on  cache  misses. 

4.2.1  The  Experiment 

To  demonstrate  the  shared  memory  example,  several  tests  were  conducted  on  a  4096 
word  shared  memory^.  The  tests  execute  sequences  of  local  and  remote  read  accesses 
that  exercise  the  protocol.  For  example,  one  test  writes  a  location  locally,  then  reads 
it  back  remotely  to  verify  that  its  read- write  cache  copy  was  flushed  properly.  Another 
test  writes  and  tests  each  memory  location  in  the  shared  address  space  to  evaluate  the 
protocol  in  a  high  request  traffic  situation.  The  shared  memory  tests  do  not  provide 
“typical”  access  patterns  since  they  are  primarily  testing  “miss”  behavior. 

Table  4.1  presents  statistics  from  the  experiment.  This  is  the  largest  model  mechanism 
example,  in  terms  of  number  of  instructions.  Because  some  shared  memory  tests  are 
very  long,  this  example  has  the  second  largest  instruction  per  task  ratio.  Yet  in  spite 
of  this,  the  experiment  has  a  run  length  of  8.1.  Except  for  one  atypical  example,  all 
experiments  in  this  chapter  have  a  run  length  less  than  ten.  This  demonstrates  Pi 
ability  to  represent  fine-grain  tasks. 

This  example  also  allocated  a  large  number  of  segments.  Most  are  task  segments 
created  by  send  and  call  (33,004  of  the  33,770  segments).  Of  766  segments  created 
explicitly,  637  were  12  word  line  objects  (512  for  chunk  storage,  125  for  cache).  These 
segments  represent  only  two  percent  of  the  segments  created.  However,  since  these 
segments  are  long-lived,  whereas  task  segments  are  typically  ephemeral,  they  consti¬ 
tute  a  significant  component  of  the  storage  requirement  for  this  example. 

This  example  presents  a  shared  memory  scheme  constructed  with  Pi.  This  implemen¬ 
tation  is  non-optimal,  but  it  demonstrates  the  necessary  mechanisms.  Future  research 
will  examine  particular  tradeoffs  in  the  protocol  specification  using  trace  data  from 
concurrent  applications. 

‘This  size  was  selected  to  allow  experimentation  on  a  sequential  Pi  simulator  (PiSim).  The 
overhead  costs  are  the  same  for  any  shared  memory  size  provided  the  euldresses  can  be  contained  in 
a  single  word. 
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A  shared  memory  location  is  more  costly,  in  terms  of  storage  and  access  time,  than 
a  normal  segment  location.  This  reflects  the  extra  cost  of  maintaining  global  consis¬ 
tency.  However,  an  efficient  shared  memory  protocol  implemented  on  an  appropriate 
Pi  substrate  can  compete  with  the  performance  of  specialized  shared  memory  archi¬ 
tectures. 


4.3  Set  Synchronization 

In  Pi,  a  set  synchronization  delay  depends  on  (a)  the  log  of  the  number  of  elements  in 
the  set.  (b)  the  distance  betw’een  elements  in  the  set,  and  (c)  message  traffic  (network 
traffic  and  message  queue  lengths).  A  machine  that  is  specialized  to  support  set 
synchronizations  can  reduce  the  cost  of  these  factors  in  several  ways. 

One  approach  is  to  distribute  global  synchronization  information  as  a  wired-or  signal. 
The  delay  is  still  dependent  on  the  separation  of  the  set  elements  (here,  the  size  of 
the  mac.iine),  but  the  constant  factor  is  smaller,  since  dedicated  wires  and  control 
logic  is  used.  But  these  resources  are  idle,  unless  a  synchronization  is  taking  place. 
Also,  this  approach  can  only  support  one  global  synchronization. 

A  more  general  approach  is  to  dedicate  special  communication  channels  for  syn¬ 
chronization  information  and  add  hardware  to  reducing  the  overhead  of  distributing 
communication  information.  This  supports  multiple  set  synchronizations  with  low 
overhead.  But  like  the  previous  approach,  the  resources  used  to  make  se  .  synchro¬ 
nization  fast  are  taken  away  from  other  functions  (especially  data  communication). 

Pi  uses  the  general  communication  mechanism  send,  to  support  set  synchronizations. 
Special  hardware  is  provided  on  each  node  to  provide  fast  set  synchronization  tasks. 
This  hardware  also  improves  other  features  of  Pi,  making  it  generally  more  valuable. 

The  cost  of  performing  a  set  synchronization  in  Pi  can  be  computed  from  the  code 
in  Appendix  A.  If  F  is  the  fanout  and  N  is  the  number  of  leaves,  the  cost  for  a 
single  synchronization  (not  including  tree  initialization,  which  is  amortized)  can  be 
determined. 

The  root  does  a  send,  adjusts  a  barrier,  and  waits  for  the  FINISH  signal  to  arrive. 
When  it  does,  the  barrier  is  adjusted,  and  the  signal  segment  is  destroyed.  The  total 
cost  is  one  send,  two  adjust-count,  one  test-count,  and  one  destroy-segment. 

.After  a  START  signal  is  received  (one  s-sync  write  and  one  destroy-segment),  the  stem 
forwards  F  copies^  of  the  START  signal  to  its  children  (one  initialization,  one  s-sync 
read.  F  send,  plus,  and  branch  instructions).  It  then  waits  for  N  FINISH  signals  before 
sending  a  FINISH  signal  to  its  parent  (two  branch,  F-|- 1  adjust-count,  one  test-count, 
F  destroy-segment,  and  one  send  instructions). 

Each  leaf  receives  a  START  signal  (one  s-sync  write  and  one  destroy-segment  j,  executes 
the  task,  sends  a  FINISH  signal  and  waits  for  the  next  signal  (one  send,  one  branch, 
and  one 

^Depending  on  N  and  F,  a  few  stems  have  less  than  F  children.  Therefore  this  is  the  worst  case. 
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Table  4.2:  Set  Synchronization  Costs 

Main-Loop 

;  ;  This  is  the  sectio.  of  iteration  sork  that  c2mnot  be  overlapped, 
(write  (self)  Index  (minus  (read  (self)  Index)  1)) 

(branch-not-zero  (read  (self)  Index)  Main-Loop) 

; ;  The  finish  signal  is  sent  to  the  parent  stem. 

(send  (read  (self)  Up-Mode)  6  NORMAL  Adjust-Bcorier 

-1  (read  (self)  Up-Segment)  DOWN-BARRIER-OFFSET) 

; ;  This  is  the  section  of  the  iteration  work  that  can  be  overlapped. 

(write  (self)  Index  WORK-SIZE) 

; ;  The  leaf  waits  to  begin  the  next  iteration. 

(write  (self)  Temp  (read  (self)  Up-Sync))  ;;  wait  for  start  signal 
(branch-not-zero  (compare  (read  (self)  Temp)  END)  Main-Loop) 


Figure  4.2:  Leaf  Task  Fragment 

s-sync  read). 

The  total  iteration  cost  for  each  object  is  summarized  in  Table  4.2.  For  a  given  ^ 
and  jV,  the  cost  of  an  iteration  can  be  computed  by  multiplying  the  number  of  each 
type  of  object  by  the  cost  of  that  object  and  summing  the  result.  For  example,  if 
F  =  b  and  N  =  100,  the  synchronization  tree  contains  one  root,  25  stems,  and  100 
leaves.  For  this  tree,  the  overhead  cost  of  one  synchronization  is  listed  in  Table  4.2. 

Strict  synchronization  is  not  always  necessary.  By  overlapping  non-synchronized  code 
during  the  signaling  phase,  the  co.-',  of  synchronization  can  be  reduced.  For  exam¬ 
ple,  if  a  leaf  task  can  do  part  of  the  iteration  work  (that  can  be  overlapped)  after 
sending  the  FINISH  signal  to  the  stem,  the  time  of  that  work  is  subtracted  from  the 
synchronization  cost  (see  Figure  4.2). 

4.3.1  The  Experiment 
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total  tasks  executed 

25,579 

total  instructions  executed 

1,165,353 

average  instructions/task 

45.6 

average  instruction  run  length 

28.4 

average  operations  per  instruction 

4.1 

total  read- write  segments  created 

25,632 

average  read-write  segment  length 

6.0 

Table  4.3:  Set  Synchronization  Statistics 

The  set  simulation  experiment  was  simulated  with  100  tasks  executing  100  iterations 
of  simulated  work  task  (here,  the  task  is  counting  to  50).  Table  4.3  summarizes  the 
statistics. 

This  example  has  an  abnormally  high  number  of  instructions,  average  instructions 
per  task,  and  average  instruction  run  length  because  of  the  simulated  iteration  task:  a 
counting  loop  that  executes  *100  logical  instructions  without  suspension.  In  a  real  ap¬ 
plication.  the  task  would  probably  suspend  several  times  during  an  iteration  because 
of  communication  with  other  leaves.  If  an  application  task  could  run  unsuspended 
for  an  extended  period,  the  compiler  would  insert  voluntary  suspends  in  the  iteration 
sequence. 

In  this  experiment,  a  scalable  synchronization  tree  is  implemented  in  Pi.  The  example 
demonstrates  the  use  of  synchronization  trees.  Future  work  will  investigate  compiler 
techniques  to  reduce  set  synchronization  costs. 


4.4  Object  Name  Translation 

This  example  is  a  straightforward  implementation  of  a  distributed  name  resolution 
system.  It  uses  associative  segments  in  two  capacities:  as  map  storage  and  client 
caches.  Yet  these  usages  have  significantly  different  requirements.  Maps  require  safe 
storage  of  translations,  and  can  grow  to  a  relatively  large  size.  Client  caches  need 
not  be  safe,  since  the  translation  can  always  be  retrieved  from  the  appropriate  map. 
However,  since  clients  reside  on  nodes  with  the  application  that  uses  the  transla¬ 
tion  system,  they  must  have  fixed-sized  storage.  These  particular  requirements  are 
indicated  by  the  boundedness  and  safety  parameters  of  the  associative  segment. 

This  example  exploits  locality  using  the  Pi  operation  distance.  When  a  new  client 
is  created,  a  prototype  agent  is  initially  supph  Then  the  Closest-Agent  handler, 
shown  in  the  code  fragment  in  Figure  4.3,  is  sent  by  the  client  to  the  prototype  agent. 
This  handler  measures  the  distance  between  the  new  client  and  each  agent,  returning 
the  closest  agent. 

This  technique  results  from  a  compromise  made  in  Pi’s  design.  If  the  underling 
substrate  topology  was  visible  in  Pi,  this  technique  w'ould  not  be  necessary.  Agents 
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; ;  This  handler  finds  the  closest  agent  to  a  client.  Each  agent 
; ;  compares  its  distance  to  the  current  distance.  If  it  is  closer,  it 
; ;  updates  the  agent  node  and  segment  uith  it  oun  value  and  sets  the 
: ;  nee  distance.  When  the  last  agent  makes  the  comparison,  the 
;;  closest  agent  is  returned  to  the  client. 

(def ine-handler  Closest-Agent  (Count  This-Agent  Client-Mode  Client-Segment 

Distance  Agent-Mode  Agent-Segment)  () 
(branch-zero  (compare  (read  (self)  Distance)  -1)  Skip-A) 

(branch-not-minus  (compare  (distance  (read  (self)  Client-Node)  (node-id)) 

(read  (self)  Distance))  Skip-B) 

Skip-A 

(write  (self)  Distance  (distance  (read  (self)  Client-Mode)  (node-id))) 
(write  (self)  Agent-Node  (node-id)) 

(write  (self)  Agent-Segment  (read  (self)  This-Agent)) 

Skip-B 

(write  (self)  Count  (plus  (read  (self)  Count)  1)) 

(branch-zero  (compare  (read  (self)  Count)  MUMBER-OF- AGENTS)  Skip-C) 

(send  (read  (read  (self)  This-Agent)  AGEMT-MEXT-MODE) 

10  NORMAL  Closest-Agent  (read  (self)  Count) 

(read  (read  (self)  This-Agent)  AGEMT-MEXT-SEGMEMT) 

(read  (self)  Client-Node)  (read  (self)  Client-Segment) 

(read  (self)  Distance)  (read  (self)  Agent-Mode) 

(read  (self)  Agent-Segment) ) 

(branch-zero  0  Skip-D) 

Skip-C 

(send  (read  (self)  Client-Node)  6  NORMAL  Reply-Value 

(read  (self)  Agent-Node)  (read  (self)  Client-Segment) 
CLIENT-AGENT-NODE) 

(send  (read  (self)  Client-Mode)  6  NORMAL  Reply-Value 

(read  (self)  Agent-Segment)  (read  (self)  Client-Segment) 
CLIENT-AGENT-SEGMENT) 

Skip-D 

(destroy-segment  (self))) 


Figure  4..1:  Closest  Agent  Fragment 
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total  tasks  executed 

2,649 

total  instructions  executed 

14.049 

average  instructions/task 

5.3 

average  instruction  run  length 

4.5 

average  operations  per  instruction 

6.2 

total  read- write  segments  created 

2,786 

average  read- write  segment  length 

8.1 

Table  4.4:  Translation  Statistics 

could  be  distributed  uniformly  across  the  machine  at  known  locations,  so  a  new  agent 
could  directly  access  the  nearest  one.  But  this  would  commit  the  implementation  to 
a  specific  topology.  Using  the  more  general  technique  in  this  example,  the  closest 
agent  is  obtained  (with  some  computation  cost)  regardless  of  the  topology. 

The  cost  of  translation  is  computed  from  the  Match,  Agent-Match,  and  Match-In- 
Map  handlers  in  Appendix  A.  If  a  match  hits  in  the  client  translation  cache,  the  cost 
is  one  match  and  one  branch  operation.  If  the  translation  misses  in  the  cache,  the 
translation  must  be  fetched  from  the  appropriate  map  via  an  agent.  This  requires  two 
match,  one  branch,  three  send,  one  insert,  three  destroy- segment,  one  d-sync  attribute, 
one  d-sync  read,  and  one  mod  instructions. 

4.4.1  The  Experiment 

This  experiment  was  executed  with  ten  maps,  25  agents,  and  25  clients.  The  test 
program  injects  100  translations  into  the  system  via  different  clients.  Then  it  reads 
the  translations  backs  and  tests  the  results.  It  then  rereads  the  results  to  test  the 
client  cache. 

Table  4.4  summarizes  the  statistics.  This  is  the  smallest  example,  in  terms  of  instruc¬ 
tions.  It  also  sets  the  lower  bound  for  task  size  and  run  length.  A  Pi  substrate  must 
have  low  enough  task  management  overhead  to  support  an  execution  sequence  of  4.4 
logical  instructions. 

This  translation  system  demonstrates  one  straightforward  example  of  a  distributed 
service.  It  is  efficiently  implemented  (typically  with  two  logical  instructions,  13  in¬ 
structions  worst  case)  using  short  tasks  and  local  communication.  This  translation 
system  is  used  by  the  next  mechanism  example. 
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; ;  When  a  non-resident  handler  is  invoked,  this  handler  is  called  as 
; ;  the  normal  handler  type.  It  then  uses  the  first  argument  as  the 
; ;  non-resident  handler  type.  It  tests  if  the  handler  is  present  on 
;;  the  node.  If  not,  it  obtains  it.  Then  it  invokes  it  vith  the 
;;  remaining  arguments.  A  special  argument  segment  must  be  created. 

; ;  Since  the  locals  for  a  non-resident  handler  are  included  in  this 
; ;  space ,  care  must  be  taken  to  only  copy  parameters  into  it . 

(def ine-handler  Dispatch  (Name  : VARIABLE)  (Handler  Message-Length  Arguments 

Index- A  Index-B) 

(srite  (self)  Handler  (match  (read  (nodals)  Current-Heindlers) 

(read  (self)  Name))) 

(branch-not-zero  (compare  (read  (self)  Handler)  UNBOUND)  Handler-Knovn) 
(call  6  Lookup-Handler  (read  (self)  Name)) 

(write  (self)  Handler  (match  (read  (nodals)  Current -Haindlers) 

(read  (self)  Name))) 

Handler-Known 

(write  (self)  Message-Length  (minus  (read  (self)  LENGTH)  6)) 

(write  (self)  Arguments 

(create-read-write-segment  (minus  (read  (self)  LENGTH)  10))) 

(write  (self)  Index-A  9) 

(write  (self)  Index-B  0) 

Loop 

(branch-not-zero  (probe  (self)  (read  (self)  Index-A)  READ)  Skip) 

(write  (read  (self)  Arguments)  (read  (self)  Index-B) 

(read  (self)  (read  (self)  Index-A))) 

Skip 

(write  (self)  Index-A  (plus  (read  (self)  Index-A)  1)) 

(write  (self)  Index-B  (plus  (read  (self)  Index-B)  1)) 

(branch-minus  (compare  (read  (self)  Index-A)  (read  (self)  LENGTH))  Loop) 
(call-segment  (read  (self)  Message-Length)  (read  (self)  Handler) 

(read  (self)  Arguments)) 

(destroy-segment  (read  (self)  Arguments)) 

(destroy-segment  (self))) 


Figure  4.4:  Dispatch  Fragment 

4.5  Non-Resident  Handlers 

Non-Resident  handlers  employ  several  features  of  Pi  including  variable  argument  mes¬ 
sages  and  non-resident  handler  calling.  A  slightly  modified  version^  of  the  dispatch 
handler  is  shown  in  Figure  4.4. 

This  is  one  example  where  the  segment  parsed  to  the  call-segment  operation  can  be 
eliminated  by  the  machine  dependent  compiler.  A  selected  block  of  the  active  segment 
is  copied  to  a  new  segment  to  specify  that  a  variable  number  of  arguments  are  to  be 
passed.  The  compiler  can  figure  out  that  the  segment  is  created  only  for  use  by  the 

^PiSim  does  not  support  direct  dispatches  on  segments  (since  handlers  are  not  stored  as  seg¬ 
ments).  Therefore,  in  the  implemented  example,  the  code  segment  contains  the  name  of  the  handler 
being  executed.  See  Appendix  A  for  the  actual  code. 
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total  tasks  executed 

3,021 

total  instructions  executed 

23,915 

average  instructions/task 

7.9 

average  instruction  run  length 

7.1 

average  operations  per  instruction 

5.7 

total  read- write  segments  created 

3,683 

average  read-write  segment  length 

8.4 

Table  4.5:  Non-Resident  Handler  Statistics 

call-segment  operation.  Therefore,  it  can  call  the  message  directly  from  the  active 
segment,  avoiding  the  unnecessary  work. 

The  cost  of  dispatching  a  non-resident  handler  (which  is  present  in  the  node's  handler 
cache)  can  be  computed  from  the  code  of  the  dispatch  handler.  Besides  argument 
copying  (which  may  be  avoided  as  explained  above),  the  cost  is  one  match,  one  branch, 
one  call-segment,  and  one  destroy-segment  instructions. 

4.5.1  The  Experiment 

This  experiment  was  executed  using  a  test  non-resident  handler  that  prints  the  active 
segment  (the  printing  was  disabled  for  the  log  output  in  Appendix  B).  After  the 
handler  is  created  in  the  reference  map  and  added  to  the  translation  system,  it  is 
executed  twice  on  each  node  in  the  system.  The  first  trial  tests  the  handler  fetching 
mechanism.  The  second  trial  tests  the  handler  caching  mechanism. 

Table  4.5  summarizes  the  statistics  from  the  experiment.  Only  128  test  handlers 
are  dispatched.  Most  of  the  statistics  in  this  table  reflect  the  translation  system 
initialization.  In  the  trial,  a  times  instruction  was  used  in  place  of  the  print  statement 
to  shorten  the  (rather  dull)  log.  The  instruction  profile  in  Appendix  B  shows  that 
128  times  were  executed.  The  t«Lsk  type  profiles  shows  that  handler  lookup  system 
(e.g.,  match-agent)  was  only  executed  64  times.  Therefore,  the  local  handler  caches 
worked  properly. 

This  example  demonstrates  the  construction  of  system  code  in  Pi.  It  efficiently  sup¬ 
ports  (with  only  four  additional  logical  instructions)  a  dispatch  mechanism  that  differs 
from  the  resident  handler  dispatch  mechanism  provided  by  Pi. 

4.6  N-Body  Simulation 

In  this  example,  and  in  the  relaxation  example  in  the  next  section,  an  entire  applica¬ 
tion  is  coded  in  Pi.  They  employ  several  mechanisms  from  different  parallel  models 


Loop-A 

(srite  (sell)  AX  0) 

(vrite  (self)  AY  0) 

(adjust-count  (sell)  Barrier  (read  (sell)  Count)) 

(srite  (sell)  Index  0) 

Loop-B 

(send  (read  (read  (sell)  I-Sodes)  (read  (sell)  Index)) 

7  HORHAL  Update-Position 

(read  (sell)  X)  (read  (sell)  Y)  (read  (sell)  ID) 

(read  (read  (sell)  I-Segments)  (read  (sell)  Index))) 

(urite  (sell)  Index  (plus  (read  (sell)  Index)  1)) 

(branch-not-zero  (compare  (read  (sell)  Index)  (read  (sell)  Count))  Loop-B) 
(test-count  (sell)  Barrier) 

(write  (sell)  X  (plus  (read  (sell)  X)  (read  (sell)  VX))) 

(write  (sell)  Y  (plus  (read  (sell)  Y)  (read  (sell)  VY))) 

(write  (sell)  VX  (plus  (read  (sell)  VX)  (read  (sell)  AX))) 

(write  (sell)  VY  (plus  (read  (sell)  VY)  (reeul  (sell)  AY))) 

(write  (sell)  Tick  (plus  (read  (sell)  Tick)  1)) 

(branch-not-zero  (compare  (read  (sell)  Tick)  (read  (sell)  Last-Tick)) 
Loop-A) 


Figure  4.5:  Body  Inner  Loop  Fragment 

(specifically  dataflow  and  concurrent  object  oriented).  This  n-body  example  employs 
most  of  the  parallel  programming  features  of  Pi.  By  examining  the  code  that  imple¬ 
ments  the  application,  one  can  directly  observe  the  cost  of  the  implementation,  in 
terms  of  basic  mechanisms. 

For  example,  Figure  4.5  shows  the  inner  loop  of  the  body  object  handler.  If  there  are 
N  bodies,  each  body  executes  4-  1  adjust-count,  one  test-count,  —  1  send.  A'  -I-  5 
plus,  three  initialization,  N  -1-  1  branches,  and  A^  —  1  destroy-segment  instructions. 


4.6.1  The  Experiment 


total  tasks  executed 

213,051 

total  instructions  executed 

2,243,745 

average  instructions/task 

10.5 

average  instruction  run  length 

7.9 

average  operations  per  instruction 

6.3 

total  read-write  segments  created 

213,253 

average  read-write  segment  length 

7.1 

Table  4.6:  N-Body  Statistics  (100  Bodies,  10  Iterations) 
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total  tasks  executed 

180,156 

total  instructions  executed 

2,028,903 

average  instructions/task 

11.3 

average  instruction  run  length 

8.4 

average  operations  per  instruction 

6.3 

total  read- write  segments  created 

180,178 

average  read-write  segment  length 

6.5 

Table  4.7:  N-Body  Statistics  (10  Bodies,  1000  Iterations) 

Two  n-body  experiments  were  executed.  The  first  included  100  bodies  simulated 
for  ten  time  steps  (Table  4.6).  The  second  experiment  included  10  bodies  simulated 
for  1000  iterations  (Table  4.7).  The  operations  that  print  the  body  positions  were 
disabled  for  the  log  in  Appendix  B. 

Note  the  short  tasks  and  run  lengths.  The  task  lengths  are  short  because  of  the  high 
percentage  of  communication-only  tasks  (93%  of  the  100  body  trial,  99.9%  of  the  10 
body  trial).  The  run  length  (and  task  length)  of  the  communication-only  tasks  is  four 
logical  instructions.  The  run  length  of  body  and  interaction  tasks  is  longer,  but  still 
kept  short  by  synchronization  suspends. 

This  example  shows  the  low  overhead  of  supporting  an  application.  In  the  10  body 
example,  24.4%  of  the  logical  instruction  types  were  times,  used  in  the  interaction 
object  handler  to  compute  the  interaction  force. 

4.7  Relaxation 

This  application  is  implemented  using  only  four  handlers  types. 

Using  the  Pi  code  for  this  example,  the  “overhead”  of  supporting  the  algorithm  can  be 
computed.  The  cost  of  the  inner  loop  of  the  relaxation  (for  a  non-boundary  element) 
is  four  send,  four  match,  four  destroy-segment,  nine  branch,  one  initialize,  four  s-sync 
write,  four  s-sync  read,  four  plus,  and  one  divide  instructions. 

This  cost  could  be  further  reduced  by  handling  boundary  cases  specially  and  using 
a  more  sophisticated  initialization  scheme.  With  these  improvements,  the  inner  loop 
would  contain  ten  logical  instructions,  five  of  which  are  the  arithmetic  computation. 

4.7.1  The  Experiment 

In  this  experiment,  a  100  x  100  element  plate  was  simulated  for  100  time  steps.  Table 

4.8  summarizes  the  statistics.  These  results  are  similar  to  those  of  the  n-body  trials. 
These  larger  examples  tend  to  be  dominated  by  communication  messages. 
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total  tasks  executed 

36,201 

total  instructions  executed 

305,873 

average  instructions/task 

8.4 

average  instruction  run  length 

5.2 

average  operations  per  instruction 

5.3 

total  read-write  segments  created 

36,266 

average  read-write  segment  length 

7.0 

Table  4.8:  Relaxation  Statistics 


Relax 

1 

<  0.1% 

Reply- Value 

100 

0.3% 

Start- Element 

100 

0.3% 

Update- Temp 

36.000 

99.4% 

total 

36,201 

100.0% 

Table  4.9:  Relaxation  Task  Type  Profile 

Table  4.9  shows  the  task  type  profile  for  the  experiment.  Note  that  over  99%  of 
the  messages  are  the  communication/synchronization  message  type  Update- Temp. 
This  means  that  the  overhead  for  these  short,  ephemeral  tasks  play  a  key  role  in  the 
performance  of  this  application. 


4.8  Summary 

This  section  presents  composite  statistics  for  all  experiments  described  in  this  thesis. 


4.8.1  Instructions 

Table  4.10  lists  the  composite  dynamic  logical  instruction  profile  for  all  the  exper¬ 
iments.  Logical  instructions  that  occurred  less  than  .1  percent  are  excluded.  Note 
that  over  eighty  percent  of  the  executed  logical  instructions  are  basic  sequential  oper¬ 
ations.  This  shows  the  importance  of  efficient  sequential  support  for  this  small  set  of 
examples.  When  designing  a  parallel  architecture,  sequential  performance  must  not 
be  overlooked. 

Communication  (7.8%),  storage  management  (8.4%),  synchronization  (5.1%),  and 
associative  segment  (.9%)  instructions  have  a  significant  presence  in  the  profile.  If 
these  functions  are  not  efficiently  supported  in  the  machine  architecture,  overall  per¬ 
formance  could  be  jeopardized.  Parallel  machines  based  on  conventional  micropro- 
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instruction  type 


branch-not-zero 


times 


plus 


move 


destroy-segment 


send 


adjust-count 


exponent 


call 


branch-zero 


initialize 


and 


match 


test-count 


attribute 


a-shift 


return 


branch-minus 


divide 


number 

percentage 

1,129.039 

17.8% 

1,066.004 

16.8% 

916,048 

14.4% 

819,268 

12.9% 

552,919 

8.7% 

534.011 

8.4% 

492,619 

7.8% 

237.683 

3.7% 

94,500 

1.5% 

89,222 

1.4% 

76,567 

1.2% 

65,745 

1.0% 

55.334 

0.9% 

50,326 

0.8% 

48.120 

0.8% 

39,378 

0.6% 

29.144 

0.5% 

14,933 

0.2% 

11.160 

0.2% 

10,000 

0.2% 

8,730 

0.1% 

5,670 

0.1% 

6.355,758 

100.0% 

Table  4.10:  Dynamic  Instruction  Profile  Summary 
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operation  type 

number 

percentage 

self 

14,267,394 

39.4% 

read 

11.967,178 

33.1% 

3,540.240 

9.8% 

branch-not-zero 

1,129,039 

3.1% 

1,066.415 

2.9% 

plus 

2.5% 

minus 

701,767 

1.9% 

compare 

667,264 

1.8% 

destroy-segment 

534,011 

1.5% 

send 

488.793 

1.4% 

237,683 

0.7% 

exponent 

94,500 

0.3% 

call 

89,224 

0.2% 

branch-zero 

76.564 

0.2% 

nodals 

72.608 

0.2% 

and 

55.334 

0.2% 

match 

50,538 

0.1% 

test-count 

48.120 

0.1% 

attribute 

39,378 

0.1% 

mod 

35.449 

0.1% 

a-shift 

29,144 

0.1% 

25,372 

0.1% 

total 

36,181,081 

100.0% 

Table  4.11:  Dynamic  Operation  Profile  Summary 

ressors  are  vulnerable  unless  the  deficiencies  can  be  overcome  with  external  circuitry. 
A  Pi  substrate  must  efficiently  support  these  operations. 

4.8.2  Operations 

Table  4.11  lists  the  composite  operation  profile  for  all  the  experiments.  Operations 
that  occurred  less  than  .1  percent  are  excluded.  Since  references  to  the  active  segment 
are  explicit  in  Pi  programs,  the  leading  operation  is  self.  It  is  used  with  nearly  every 
logical  instruction.  The  machine  dependent  compiler  will  reduce  references  to  the 
active  segment  by  using  registers  for  temporaries.  Yet  the  expected  frequency  is  high 
enough  to  justify  special  handling  of  the  active  task  segment  in  a  Pi  substrate. 

In  Pi  programs,  all  read  and  write  operations  reference  a  segment.  While  the  number 
of  segment  accesses  would  also  be  reduced  by  register  usage  by  the  compiler,  the 
number  is  still  large  enough  to  require  efficient  support  for  offset  addressing  and 
bounds  checking. 
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size 

number 

percentage 

total  storage 

percentage 

1 

65 

<  .1% 

65 

<  .1% 

2 

64 

<  .1% 

128 

<  .1% 

3 

156 

<  .1% 

468 

<  .1% 

4 

16,567 

3.1% 

66,268 

1.8% 

5 

2.426 

0.5% 

12,130 

0.3% 

6 

228,666 

42.7% 

1,371,996 

37.1% 

7 

248,829 

46.5% 

1,741,803 

47.1% 

8 

16,136 

3.0% 

129,088 

3.5% 

9 

11,406 

2.1% 

102,654 

2.8% 

10 

2,570 

0.5% 

25,700 

0.7% 

11 

2 

<  .1% 

22 

<  .1% 

12 

639 

0.1% 

7,668 

0.2% 

13 

778 

0.1% 

10,114 

0.3% 

14 

154 

<  .1% 

2,156 

0.1% 

16 

1.624 

0.3% 

25,984 

0.7% 

19 

110 

<  .1% 

2.090 

0.1% 

21 

100 

<  .1% 

2,100 

0.1% 

23 

4.995 

0.9% 

114,885 

3.1% 

25 

22 

<  .1% 

550 

<  .1% 

99 

200 

<  .1% 

19,800 

0.5% 

100 

3 

<.1% 

300 

<  .1% 

1024 

64 

<  .1% 

65,536 

1.8% 

total 

535,576 

100.0% 

3,701,505 

100.0% 

Table  4.12:  Read- Write  Segment  Size  Profile 

The  exponent  operation  has  a  significant  presence  in  both  the  instruction  and  opera¬ 
tion  profiles.  As  explained  in  Chapter  3,  this  is  not  really  a  Pi  operation.  It  would  be 
supported  by  a  sequence  of  other  Pi  operations.  Since  this  sequence  would  contain 
mostly  arithmetic  instructions,  these  operations  are  underrepresented  in  the  profiles. 


4.8.3  Segments 

Table  4.12  lists  the  composite  read- write  segment  size  profile  for  all  the  experiments. 
The  average  segment  size  is  6.9.  Note  that  95.8%  of  the  segments  would  fit  in  an 
eight  w'ord  block.  99.0%  would  fit  in  a  sixteen  word  block. 

Another  interesting  parameter  is  segment  lifetime.  This  is  the  length  of  time  from 
when  a  segment  is  created  to  when  a  segment  is  destroyed.  Segment  lifetimes  are 
measured  in  Pi  operation  executions.  Table  4.13  shows  the  collected  data.  Note 
that  profile  bins  where  used  to  collect  the  data.  Each  entry  indicates  the  number  of 
segments  that  had  a  lifetime  falling  within  the  range  of  that  bin. 
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low 

high 

number 

9 

16 

46.519 

8.7 

17 

32 

162.046 

30.3 

33 

64 

34.585 

6.5 

65 

128 

59.312 

11.1 

129 

256 

45.513 

8.5 

257 

512 

33.103 

6.2 

513 

1,024 

15.813 

3.0 

1,025 

2,048 

40.428 

7.6 

2,049 

4,096 

50.181 

9.4 

4.097 

8.192 

34.112 

6.4 

8,193 

16,384 

6.496 

1.2 

16,385 

32,768 

28 

0.0 

32.769 

65,536 

176 

0.0 

65,537 

131,072 

132 

0.0 

131,073 

262,144 

60 

0.0 

262.145 

524,288 

5.051 

0.9 

524,289 

1.048.576 

454 

0.1 

1,048.577 

2,097.152 

2 

0.0 

total 

534.011 

100.0 

Table  4.13:  Segment  Age  Profile 


task  status 

number 

percentage 

NEW 

488.865 

63.9% 

WAITING 

231.992 

30.3% 

CALL 

44.740 

5.8% 

total 

765.597 

100.0% 

Table  4.14:  Task  Status  Profile 

The  average  segment  lifetime  is  5490  operations.  The  median  segment  lifetime  is  96 
operations.  This  discrepancy  is  cause  by  a  large  number  of  short-lived  segments  and 
a  small  number  of  very  long-lived  segments.  The  short-lived  segments  are  primarily 
communication-only  task  segments.  For  example,  in  the  relaxation  example.  Update- 
Temp  represented  99.4%  of  the  tasks  (and  task  segments)  executed  in  the  trial.  Yet 
this  task  typically  executed  only  four  logical  instructions  (20  Pi  operations). 

From  these  experiments,  the  typical  segment  can  be  characterized  as  small  ('7  words) 
and  ephemeral  ('100  Pi  operations).  A  Pi  substrate  must  provide  extremely  fast 
allocation  and  deallocation  for  small  segments.  Task  overhead  is  closely  dependent 
on  segment  management  efficiency. 


4.8.4  Tasks 

Table  4.14  lists  the  composite  task  status  profile  for  the  experiments.  NEW  is  the 
number  of  new  tasks  invoked  in  all  examples.  WAITING  is  the  number  of  task 
suspensions  and  resumptions.  CALL  is  the  number  of  called  tasks.  On  the  average, 
a  task  suspends  0.5  times  during  its  execution.  This  average  results  from  a  few  tasks 
that  suspend  several  times  and  many  tasks  that  never  suspend. 

During  the  experiments,  the  total  amount  of  time  each  task  spent  running  and  wait¬ 
ing  was  recorded.  The  running  time  includes  the  sum  of  all  time  when  the  task  is 
actively  executing  on  a  node.  The  waiting  time  includes  time  spent  in  the  message 
queue  waiting  to  execute,  as  well  as  waiting  time  between  suspensions.  Table  4.15 
presents  the  task  run  time  profile.  Note  that  93.5%  of  the  tasks  executed  under  32  Pi 
operations.  99.0%  executed  under  128  operations.  The  average  for  all  experiments  is 
74.3  operations. 

The  task  wait  time  profile  is  shown  in  Figure  4.16.  The  average  task  wait  time  is 
4980  Pi  operations.  The  median  task  wait  time  is  48  operations.  Note  that  38.8% 
of  the  tasks  executed  with  zero  wait  time.  These  tasks  (likely  communication-only) 
arrived,  executed  immediately,  and  completed  without  suspending. 

Task  wait  time  is  dependent  on  the  ratio  of  the  average  number  of  tasks  executing  on 
each  node.  In  these  experiments,  64  nodes  were  simulated.  Some  experiments,  like 
relaxation,  had  a  small  number  of  tasks  per  node  (typically  7.8  tasks/nodes),  while 
others,  like  100-body,  had  a  higher  number  ('150  tasks/node).  As  a  result,  the  wait 
time  because  of  long  message  queues  is  much  higher  for  the  100-body  experiment.  To 
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low 

high 

number 

percent 

9 

16 

62.070 

11.6 

17 

32 

437.094 

81.9 

33 

64 

23.290 

4.4 

65 

128 

5,590 

1.0 

129 

256 

150 

0.0 

513 

1.024 

3 

0.0 

1.025 

2.048 

5.016 

0.9 

2,049 

4,096 

5 

0.0 

4.097 

8,192 

24 

0.0 

8.193 

16.384 

105 

0.0 

16.385 

32,768 

100 

0.0 

32.769 

65.536 

100 

0.0 

39 

0.0 

131.073 

262.144 

7 

0.0 

262.145 

524.288 

12 

0.0 

total 

533.605 

iOO.O 

Table  4.15:  Task  Run  Time  Profile 

observe  the  this  effect,  the  lOO-body  experiment  was  executed  with  4096  simulated 
nodes.  The  median  task  wait  time  dropped  from  1536  Pi  operations  (on  64  nodes)  to 
one  Pi  operation  (on  4096  nodes). 

Table  4.17  shows  the  composite  statistics  for  all  the  experiments  described  in  this 
tnesis.  The  instructions/task  value  (11.9)  indicates  that  the  task  size  of  these  mech¬ 
anism  and  applications  is  small.  The  run  length  of  8.8  .logical  instructions  sugg-’sts 
that  the  lack  of  a  task  preemption  facility  in  Pi  does  not  have  a  major  effect  on  task 
waiting  time. 

In  these  examples,  it  is  possible  to  determine  the  exact  costs  of  mechanisms  in  terms  of 
parallel  operations.  Related  implementations  in  other  architectures  do  not  explicitly 
provide  this  measurement  because  the  mechanisms  are  inseparable  from  the  machine 
architecture.  The  ability  to  make  mechanism  comparisons  in  terms  of  basic  parallel 
operations  is  a  major  feature  of  Pi. 

While  these  examples  are  not  a  typical  cross  section  of  Pi  programs,  they  provide 
rough  data  for  the  specification  of  a  Pi  substrate.  The  characteristics  of  these  exam¬ 
ples  are  used  to  drive  the  design  of  a  Pi  substrate  in  the  p<^yt  chapter. 
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low 

high 

number 

percent 

0 

0 

206.830 

38.8 

1 

1 

531 

0.1 

2 

2 

2.856 

0.5 

3 

4 

1.297 

0.2 

5 

8 

6.788 

1.3 

9 

16 

7,167 

1.3 

17 

32 

11.414 

2.1 

33 

64 

36,609 

6.9 

65 

128 

40,446 

7.6 

129 

256 

43.064 

8.1 

257 

512 

24.686 

4.6 

513 

1.024 

15.880 

3.0 

1.025 

2,048 

40.453 

7.6 

2,049 

4,096 

9.3 

4,097 

8,192 

33.823 

6.3 

8.193 

16..384 

6.428 

1.2 

16.385 

32,768 

128 

0.0 

32.769 

65,536 

75 

0.0 

65,537 

131.072 

132 

0.0 

131,073 

262,144 

157 

0.0 

262,145 

524,288 

5,006 

0.9 

524.289 

1,048.576 

128 

0.0 

total  T  533.605  100.0 


Table  4.16;  Ta.sk  Wait  Time  Profile 


total  tasks  executed 

720.857 

total  instructions  executed 

t  ,355,758 

average  inslructions/task 

11.9 

average  instruction  run  length 

8.8 

average  operations  per  instruction 

5.7 

total  read- write  segments  created 

535.576 

average  read- write  segment  length 

6.9 

Table  4.17;  Composite  Example  Statistics 
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Chapter  5 

A  Pi  Substrate 


Chapter  3  dealt  with  issues  above  the  interface,  namely  the  construction  of  mech¬ 
anisms  for  different  computational  models.  This  chapter  considers  issues  below  the 
interface,  the  machine  substrate  that  supports  Pi. 

Pi  is  supported  by  hardware  in  combination  with  a  machine  dependent  compiler.  In 
this  chapter,  the  machine  architecture  is  examined.  The  machine  dependent  compiler 
is  considered  in  Chapter  4.  The  design  of  a  Pi  substrate  is  examined  only  &s  far 
as  necessary  to  demonstrate  that  it  can  be  built.  It  is  left  for  future  research  (and 
funding!)  to  complete  the  engineering  and  construction  of  the  substrate. 

This  chapter  includes  the  definition  of  a  machine  architecture,  PiMac,  plus  a  discus¬ 
sion  of  an  implementation.  To  avoid  confusion,  here  are  the  definitions  of  ’’machines” 
described  in  this  thesis. 

Pi  Abstract  Machine  This  is  the  architecture  of  the  Pi  interface.  A  machine  def¬ 
inition  is  used  as  a  medium  for  defining  the  interface.  However,  a  particular 
approach  to  supporting  the  interface  is  not  specified. 

PiMac  Machine  Architecture  This  a  specific  machine  architecture  design  to  sup¬ 
port  Pi.  Its  definition  includes  details,  such  as  microarchitecture  and  instruction 
set  architecture.  A  specific  hardware  implementation  is  not  specified. 

PiMac  1:  a  Gate  Array  PiMac  Implementation  This  is  a  design  for  an  imple¬ 
mentation  of  the  PiMac  machine  architecture.  It  uses  a  specific  technology  and 
incorporates  a  certain  set  of  implementation  goals  and  restrictions. 

This  chapter  begins  the  design  goals  of  a  Pi  substrate.  It  then  describes  the  PiMac 
machine  architecture.  Finally,  a  gate  array  implementation*  is  considered. 

‘For  a  more  detailed  discussion  of  the  differences  between  machine  architecture  and  machine 
implementation,  see  [47], 
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word  size 

48  bits 

maximum  storage  per  node 

65,536  words 

minimum  segment  size 

8  words 

maximum  segment  size 

4096  words 

task  limit  (per  node) 

255  ttisks 

segment  limit  (per  node) 

1,023  segments 

Table  5.1:  PiMac  Machine  Details 

5.1  Design  Goals 

Chapter  1  introduces  a  list  of  parallel  model  requirements  which  are  the  basis  of  the 
architectural  interface.  The  support  of  these  requirements  underlies  the  design  goals 
stated  here.  The  detailed  goals  driving  the  design  of  a  Pi  machine  include: 

•  low  latency  communication 

•  fast  task  switching  (suspensions,  queuing,  and  resumptions) 

•  multiple  task/message  queues 

•  support  for  attributes  (synchronization) 

•  efficient  storage  management 

•  support  for  read/ write  and  associative  segments 

•  fast  sequential  execution 

The  design  of  PiMac  aspires  to  one  additional  goal  that  is  not  required  by  Pi.  PiMac 
is  fine-grained;  both  in  terms  of  physical  nodes  size,  and  the  task  size  which  it  can 
efficiently  support.  .A.  coarse-grain  Pi  machine  is  also  feasible.  In  fact,  a  coarse-grain 
system  can  tolerate  higher  task  overhead,  since  fewer  tasks  are  executed  overall. 
Unfortunately,  coarse-grain  systems  have  less  potential  for  parallelism  as  well. 


5.2  PiMac:  A  Pi  Machine  Architecture 

PiMac  is  a  parallel  machine  designed  to  meet  the  requirements  of  the  Pi  architectural 
interface.  Table  5.1  summarizes  some  details  of  the  machine.  These  parameters  are 
selected  based  on  (a)  recent  results  from  fine-grain  message- passing  research  [27,  28] 
and  (b)  from  the  examples  in  Chapter  3.  This  machine  architecture  incorporates 
many  ideas  from  the  Message  Driven  Processor  [14.  15].  The  design  of  other  parallel 
machines  has  influenced  PiMac  [53.  7,  51,  31,  56.  4,  43,  8,  45,  13,  30]. 
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Tlie  memory  limit  is  small  compared  with  microprocessors  with  comparable  sequential 
performance.  The  Amdahl/ Case  rule  [25]  suggests  one  megabyte  of  memory  per  MIPS 
of  processor  performance.  By  this  rule,  a  PiMac  node  with  65  thousand  words  (256K 
bytes)  is  balanced  with  .25  MIPS  processor  implementation.  Yet  a  significantly  faster 
implementation  is  expected  ('25  MIPS). 

The  Amdahl/Case  rule  is  used  to  guarantee  high  processor  utilization.  It  evolved 
when  processor  hardware  was  expensive  relative  to  memory.  A  low  processor  utiliza¬ 
tion  (<  10%),  was  considered  excessively  wasteful.  Now,  processor  logic  consumes 
a  much  smaller  fraction  of  the  available  resources  (silicon  area).  Under  this  circum¬ 
stance,  excessive  concern  about  processor  utilization  is  inappropriate. 

The  Amdahl/Case  rule  also  suggests  one  megabit  of  I/O  bandwidth  per  MIP  of 
processor  performance.  For  a  25  MIPS  PiMac,  this  rule  would  expect  25  megabits 
per  second  of  I/O  bandwidth.  The  router  used  in  PiMac  provides  an  estimated  540 
megabits  per  second.  This  reflects  the  increased  requirement  for  communication  in  a 
fine-grain  multiprocessor. 

PiMac’s  memory  and  I/O  p  arameters  reflect  the  fine-grain  nature  of  the  machine. 
Results  by  Horwat  [2S]  suggest  that  these  parameters  are  adequate  for  fine-grain 
tasks.  This  choice  of  memory  size  is  also  supported  by  the  examples  in  Chapter  3, 
where  the  average  segment  size  is  "7  words.  However,  the  issue  of  small  memory  nodes 
is  an  open  research  question. 


5.2.1  Microarchitecture 

The  PiMac  microarchitecture  is  illustrated  in  Figure  5.1.  It  is  composed  of  six  mod¬ 
ules. 

Router 

PiMac  nodes  are  connected  in  a  three  dimensional  mesh  network.  The  network  trans¬ 
ports  messages  as  chains  of  18  bit  (17,  54]  which  are  directed  using  a  non-adaptive 
routing  algorithm.  The  network  employs  wormhole  routing  to  reduce  latency.  The 
router  module  is  one  node’s  contribution  to  the  composite  system  network.  Unless 
a  message  is  being  sent  or  received  by  a  node,  the  message  router  operates  indepen¬ 
dently  from  the  rest  of  the  node.  The  specifications  of  the  router  module  are  taken 
directly  from  the  MDP  [18,  54j. 


Network  Controller 

The  network  controller  transfers  messages  from  the  network  to  local  memory.  It  also 
inserts  messages  from  the  datapath  into  the  network.  When  a  message  arrives  at 

•A  flit  is  the  smallest  piece  of  data  that  is  arbitrated  in  the  network. 
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controller 


external 

memory 


Figure  5.1:  PiMac  Microarchitecture 
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a  node,  the  network  controller  allocates  an  appropriate  sized  storage  block  in  local 
memory.  The  controller  then  receives  and  stores  the  message  in  memory.  Finally,  it 
adds  the  message  to  a  task  queue  for  later  execution. 


Datapath 

The  datapath  module  is  composed  of  two  separate  datapaths.  The  first  is  the  tradi¬ 
tional  integer  datapath  containing  a  few  data  registers  and  a  general  purpose  ALU. 
This  datapath  is  designed  to  provide  fast  context  switching,  and  efficient  support  for 
several  Pi  operations. 

The  second  datapath  supports  the  specialized  requirements  of  addresses.  This  data¬ 
path  performs  operations  to  support  segment  addressing,  including  bounds  checking 
and  offset  accessing. 

Segment  Buffers 

These  buffers  are  positioned  between  the  datapath  and  the  memory.  They  act  as  a 
cache  for  recently  accessed  memory  segments.  They  also  provide  hardware  support  for 
Pi  synchronizations.  One  segment  buffer  is  also  used  for  buffering  incoming  messages 
received  from  the  network  controller. 


Memory  Controller 

The  memory  controller  supports  memory  transfers  between  the  buffers  and  internal 
and/or  external  memory.  In  some  PiMac  implementations,  memory  caches  are  also 
provided  in  this  module. 


Instruction  Unit 

This  module  fetches  and  decodes  instructions  for  execution.  It  executes  in  paral¬ 
lel  with  the  datapath  to  provide  one  instruction  per  clock  for  most  non-branching 
instructions.  A  special  segment  buffer  caches  several  instructions  from  memory. 


Controller 

This  module  coordinates  most  of  the  activity  in  the  microarchitecture.  An  excep¬ 
tion  is  the  network  controller  and  router  which  operate  autonomously.  The  network 
controller  bridges  the  two  “machines”. 
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Figure  5.2:  A  Segment  Descriptor 


5.2.2  Segments 

The  node  memory  space  is  addressed  in  sixteen  bits.  PiMac  allows  linear  addressing 
of  memory.  However,  the  expected  mode  of  node  memory  access  is  via  segments.  A 
segment  can  reside  anywhere  in  the  address  space,  but  must  be  aligned  on  an  eight 
word  boundary.  An  individual  segment  can  be  up  to  4096  words  long.  Segment  sizes 
are  integer  multiples  of  eight  words.  Therefore  a  segment  base  pointer  is  16  -  3  =  13 
bits  long  and  a  segment  size  is  12  -  3  =  9  bits.  A  segment  pointer  includes  both  the 
base  and  size  specifiers,  so  segment  accesses  are  bounds  checked.  This  is  illustrated 
in  Figure  5.2. 

When  a  segment  size  field  is  zero,  bounds  checking  is  disabled.  A  zeroed  segment 
pointer  is  used  for  non-segment  (linear)  addressing  of  memory.  The  offset  into  this 
segment  is  an  absolute  memory  address. 

To  minimize  segment  allocation  time,  a  list  of  free  segments  is  maintained  by  a  free 
segment  list  which  contains  a  linked  list  of  fixed  sized^  free  segments. 

Usually,  the  processor  or  network  controller  can  quickly  access  an  adequate  sized 
segment  from  this  list.  If  the  length  field  is  larger  than  the  fixed  size  of  segments  on 
the  free  list,  or  if  the  free  list  is  empty,  a  trap  is  generated,  causing  the  node  processor 
to  execute  a  special  handler  to  allocate  the  required  segment  or  segments. 

Each  segment  is  assigned  a  ten  bit  segment  ID  which  provides  a  logical  name  for  it 
on  the  node.  This  name  is  provided  externally  to  identify  the  segment.  A  segment 
map  maintains  the  base  and  size  specifiers  for  each  segment. 

•^The  optimal  size  will  be  determined  in  future  parallel  operating  system  experiments.  Based  on 
the  examples  in  Chapter  3.  an  expected  size  is  24. 
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Figure  5.3:  Indexing  into  the  Segment  Map 

Segment  ID  allocation  is  usually  combined  with  storage  allocation.  VV'hen  a  segment’s 
storage  is  freed,  the  segment  ID  is  reclaimed  with  it.  The  segment  storage  and  ID  are 
placed  on  the  free  segment  list  together.  When  a  segment  is  allocated  from  the  free 
list,  it  is  assigned  the  same  segment  ID.  Since  most  segments  are  short-lived,  this  is 
the  usual  technique  for  segment  ID  management. 

Two  cases  require  more  complicated  segment  ID  allocation  technique.  If  the  free 
segment  list  is  empty,  or  if  the  fixed  segment  size  of  the  free  list  is  not  large  enough 
to  accommodate  the  required  storage.  In  these  cases,  the  segment  map  is  scanned  for 
unused  segment  IDs. 

A  complete  segment  descriptor  includes  the  segment  ID,  base  pointer,  and  size  de¬ 
scriptor.  If  segments  are  relocated  (e.g.,  during  heap  compaction),  the  segment  bases 
of  all  segment  descriptors  are  invalidated.  When  the  segment  descriptor  is  referenced 
again,  the  segment  ID  is  used  to  access  the  segment  map  to  obtain  the  segment  base 
and  size  specifiers.  This  process  is  illustrated  in  Figure  5.3. 

5.2.3  Word  Format 

A  named  word  in  PiMac  is  composed  of  five  fields: 


Type (3) 


.Access  Pattern  (3) 


Access  Locks  (2) 


Suspended  Tasks  (8) 


Value  (32) 


The  Type,  Access  Pattern,  Access  Locks,  and  Suspended  Tasks  fields  are  collectively 
called  the  tag  of  the  value. 
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type 

description 

format 

0 

integer 

two’s  compliment  value 

1 

symbol 

value  (0  =  NIL) 

2 

boolean 

one  bit  boolean 

3 

segment  descriptor 

(segment  number,  base,  size) 

4 

instruction 

instruction  value 

5 

user  defined  type  A 

b 

user  defined  type  B 

7 

user  defined  type  C 

Table  5.2;  Word  Types 


code 

name 

description 

0 

D-Sync 

d-sync  access  pattern 

1 

S-Sync 

s-sync  access  pattern 

2 

B-Sync 

b-sync  access  pattern 

3 

VVTite  Onec 

write  once  access  pattern 

4 

Read  Only 

read  only  access  pattern 

5 

Write  Only 

write  only  access  pattern 

Table  5.3:  Access  Pattern  Types 


Type 

This  field  determines  the  type  of  data  stored  in  the  value  field.  The  types  are  defined 
in  Table  5.2.  .Arithmetic  operations  on  user  defined  types  result  in  user  trap  handlers. 
User  defined  types  are  used  to  support  special  types  such  as  floating  point  numbers 
and  complex  numbers. 


Access  Pattern 

This  field  defines  the  access  pattern  for  this  word  of  memory.  Used  with  the  access 
lock  bits,  it  determines  what  accesses  are  permissible,  and  how  access  locks  bits  should 
be  changed  by  an  access.  The  access  patterns  are  defined  in  Table  5.3. 

Access  Locks 

Each  named  word  include  two  access  locks:  read  and  write.  On  each  access  to  the 
word,  the  corresponding  access  bit  is  tested.  If  the  access  is  locked,  the  operation  is 
canceled  and  a  microsequence  is  initiated  which  suspends  the  current  task. 
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If  an  access  is  not  locked,  the  access  completes,  and  the  lock  bits  are  modified  in 
accordance  with  access  pattern  defined  in  the  access  pattern  field.  For  some  access 
patterns,  a  particular  access  results  in  the  requeuing  of  tasks  contained  in  the  sus¬ 
pended  task  list. 


Suspended  Tasks 

This  field  points  to  the  head  of  a  list  of  tasks  suspended  on  this  word.  If  an  access 
of  this  word  causes  the  current  task  to  be  suspended,  it  is  pushed  on  the  head  of 
this  list.  If  an  access  of  this  word  causes  requeuing  of  the  suspended  tasks,  each  task 
is  added  to  the  appropriate  task  queue.  Tasks  in  the  suspended  task  list  are  linked 
together  by  task  links.  A  task’s  link  is  maintained  by  the  suspended  tasks  field  of 
first  word  of  the  task  segment. 


Value 

The  remaining  field  of  a  word  is  the  value  field.  It  contains  a  32  bit  value. 


5.2.4  Register  Architecture 

The  PiMac  register  architecture  is  illustrated  in  Figure  5.4.  This  section  describes 
each  component. 


Active  Segment 

.A.t  any  given  moment,  a  PiMac  node  is  either  executing  a  task  or  waiting  for  a 
message  (idle).  If  a  task  is  executing,  the  active  segment  is  defined  in  the  active 
segment  register.  This  register  includes  a  pointer  to  the  physical  location  of  the 
segment  in  local  memory.  This  is  used  in  several  addressing  modes  of  the  instruction 
set  architecture.  It  also  maintains  the  task  number,  a  logical  name  of  the  task  used 
in  both  task  management  and  synchronization  support  in  PiMac.  The  value  of  the 
active  segment  register  is  set  automatically  when  a  task  begins  execution. 


Code  Segment 

Each  task  segment  includes  a  reference  to  a  segment  which  contains  the  code  for  the 
handler  being  executed.  This  handler  is  also  set  when  a  task  begins  execution. 


Instruction  Pointer 

This  register  indexes  the  current  instruction  in  the  code  segment.  It  is  analogous  to 
a  program  counter  in  a  microprocessor. 
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Figure  5.4;  PiMac  Register  Architecture 
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General  Registers 

Six  general  purpose  registers  are  provided  in  the  register  architecture.  They  are  used 
to  supplement  storage  in  the  active  segment  and  reduce  sw.^-.pping  of  the  segment 
buffers. 


Segment  Buffers 

Four  segment  buffers  provide  dedicated  caches  for  node  memory.  Each  buffer  contains 
a  quad  word'*  of  a  segment.  Each  buffer  maintains  a  key  referencing  its  physical  address 
in  memory.  Each  buffer  also  includes  a  dirty  bit  which  indicates  whether  the  buffer 
must  be  written  back  to  memory  when  it  is  replaced. 

One  buffer  is  dedicated  for  the  code  segment  of  the  active  task.  Another  always 
includes  a  quadword  of  the  active  segment.  The  third  buffer  can  include  a  quadword 
from  any  segment  addressed  by  a  segment-accessing  instruction.  The  last  buffer  is 
used  to  buffer  incoming  messages  to  the  node.  Outgoing  messages  are  sent  directly, 
without  buffering. 

Task/Message  Queues 

These  registers  maintain  the  head  and  tail  of  four  task  queues  (one  for  each  prior¬ 
ity).  When  a  message  is  received  from  the  network,  it  is  enqueued  at  the  tail  of  the 
appropriate  task/message  queue  by  the  network  controller.  When  a  task  is  requeued 
after  a  suspension,  it  is  inserted  at  the  tail  of  a  priority  queue.  The  particular  queue 
is  determined  by  the  priority  field  of  the  task. 

These  registers  are  shared  by  the  network  controller  and  the  processor.  Accesses  to 
them,  and  node  memory,  are  arbitrated  by  hardware  to  guarantee  mutual  exclusion. 


Trap  Task  Register 

This  register  maintains  the  ID  of  the  task  which  was  executing  when  the  trap  occurred. 
Since  traps  are  disabled  during  trap  handling,  this  register  is  never  inadvertently 
overwritten. 


Free  Segments 

This  register  maintains  the  head  of  the  free  segment  list.  It  also  contains  the  fixed 
segment  size  contained  in  the  list.  The  fixed  size  is  a  multiple  of  eight  words,  so  the 
five  bits  of  the  size  field  are  the  most  significant  of  a  eight  bit  size  field. 

'*Quadwords  are  aligned  on  four  word  boundaries. 
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state 

size 

saved  on  switch? 

resumed? 

active  segment  pointer 

one  word 

no 

no^ 

code  segment  pointer 

one  word 

no 

no^ 

instruction  pointer 

12  bits 

yes 

yes 

general  registers 

six  words 

yes 

yes 

instruction  buffer 

quadword 

no 

yes 

active  segment  buffer 

quadword 

sometimes 

sometimes 

other  segment  buffer 

quadword 

sometimes 

sometimes 

message  input  buffer 

quadword 

not  affected 

not  affected 

task  queues 

two  words 

not  affected 

not  affected 

free  segments  pointers 

15  bits 

not  affected 

not  affected 

clock 

two  words 

not  affected 

not  affected 

node  ID 

20  bits 

not  affected 

not  affected 

trap  task  pointer 

eight  bits 

not  affected 

not  affected 

total 

.30  words 

8-1-8  words 

12  -f  8  words 

Table  5.4;  Register  Architecture  Suspend/ Resume  State 


Clock 

When  a  PiMac  machine  is  initialized,  a  64  bit  free  running  clock  on  each  node  is 
zeroed.  This  clock  can  be  accessed  to  provide  synchronized  timestamps  on  different 
nodes.  It  is  also  used  to  support  the  Pi  operation  time.  At  50  MHz,  this  clock  wraps 
around  every  eleven  thousand  years. 


Node  ID 

This  register  is  set  by  software  during  initialization.  It  supports  the  node-id  operation 
in  Pi. 


Segment  Map 

Each  segment  on  a  node  is  assigned  a  unique  name  or  segment  ID.  The  node  maintains 
a  segment  map  which  maps  these  numbers  to  the  corresponding  segment  in  memory. 
This  table  is  located  in  local  memory  at  address  0000— 03FF.  It  can  be  accessed  in  a 
microsequence  in  one  memory  cycle. 

^These  values  must  be  restored  from  the  segment  map. 
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5.2.5  Instructicn  Set  Architecture 

The  instruction  set  for  PiMac  contains  the  typical  collection  of  sequential  instruc¬ 
tions.  plus  several  additional  instructions  for  efficient  Pi  support.  The  instruction  set 
architecture  is  composed  of  several  instruction  types,  each  supporting  one  or  more 
addressing  modes. 

Addressing  Modes 

PiMac  supports  register  to  register  addressing  modes.  In  addition,  it  provides  special 
modes  for  segment  addressing.  The  eight  addressing  modes  in  the  instruction  set 
architecture  are: 


mode  0 


register  <—  register  op  (segment  -1  variable  offset) 


This  addressing  mode  accesses  one  operand  from  a  register.  The  second  operand  is 
accessed  from  a  segment.  The  segment  is  specified  by  a  register  containing  a  segment 
pointer.  The  offset  is  specih'  by  a  register  containing  an  integer.  Both  registers  are 
specified  by  the  instruction. 


mode  1 


register  register  op  (segment  -f  constant  offset) 


This  addressing  mode  accesses  one  operand  from  a  register.  The  second  operand  is 
accessed  from  a  segment.  The  segment  is  specified  by  a  register  containing  a  segment 
pointer.  The  offset  is  specified  by  a  14  bit  constant  stored  in  the  instruction. 


mode  2 


register  <—  register  op  register 
In  this  addressing  mode,  both  operands  and  the  destination  are  specified  by  registers. 


mode  3 


register  <—  register  op  constant 


This  addressing  mode  accesses  one  operand  from  a  register.  The  second  operand  is 
a  16  bit  sign  extended  constant  stored  in  the  instruction.  The  result  is  stored  in  a 
register. 


mode  4 


(segment  -f  variable  offset)  ♦—  register 


This  addressing  mode  is  for  storing  a  value  from  a  register  to  a  segment.  As  in  mode 
0,  The  segment  slot  is  specified  by  two  registers,  a  segment  pointer  and  an  integer 
offset. 


mode  5 


(segment  +  constant  offset)  <—  register 


This  addressing  mode  is  for  storing  a  value  from  a  register  to  a  s*‘grnpnt  As  in  mode 
3,  The  segment  slot  is  specified  by  a  register  containing  a  segment  pointer  and  a  1 1 
bit  offset  constant. 


mode  6 


register  special  register 
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code 

name 

description 

0 

Rz 

constant  (zero) 

1 

Rs 

active  segment  pointer 

2 

Ra 

register  A 

3 

Rb 

register  B 

4 

Rc 

register  C 

5 

Rd 

register  D 

6 

Re 

register  E 

7 

Rf 

register  F 

Table  5.5:  Instruction  Register  Codes 

This  addressing  mode  is  for  loading  a  value  from  a  special  register. 

special  register  <—  register 

This  addressing  mode  is  for  storing  a  special  register. 

.\s  mentioned  earlier,  a  zero  segment  size  descriptor  disables  segment  bounds  checking. 
This  provides  a  mechanism  for  non-segment  memory  access  using  a  segment  pointer 
with  a  zero  base  and  size  descriptor.  In  mode  0  and  4,  the  offset  register  can  contain  up 
to  a  16  bit  offset.  This  allows  the  entire  local  memory  to  be  treated  cis  a  contiguously 
addressable  block.  This  feature  is  important  for  system  routines  like  heap  compaction. 

Table  5.5  lists  the  codes  for  register  fields  of  an  instruction.  Besides  the  six  general 
registers,  it  includes  two  special  references.  Code  zero  accesses  a  pseudo- register 
RO  which  always  contains  a  zero.  It  can  be  written  (unsuccessfully)  to  other  values 
without  an  error.  It  is  useful  for  supporting  operations  like  move  (add  Ri,  Rj,  RO), 
and  nop  (and  RO,  RO,  RO).  It  is  also  useful  as  the  base  register  for  non-segment 
(absolute)  addressing.  Since  the  segment  number,  base,  and  size  fields  are  all  zero, 
it  can  support  linear  addressing  in  address  modes  0  and  4.  This  special  register  does 
not  cause  type  traps.  The  second  special  reference,  Rs  references  the  active  segment. 
This  pseudo-register  is  set  automatically,  and  cannot  be  modified.  It  is  provided  for 
low  overhead  access  to  the  active  segment. 

Table  5.6  lists  the  codes  for  the  special  registers  accessed  in  modes  six  and  seven. 
When  accessing  the  four  priority  queues  and  the  free  segment  list,  the  tail  pointer  is 
positioned  at  word  bits  6-15.  The  head  pointer  is  positioned  at  word  bits  22-31. 

Figure  5.5  presents  the  instruction  format  of  each  addressing  mode.  The  address 
mode  decoding  is  determined  by  the  mode  bits.  The  first  mode  bit  (bit  29)  indicates 
whether  a  constant  or  register-1  is  being  accessed  from  the  instruction.  The  second 
mode  bit  (bit  30)  determines  whether  the  operation  accesses  a  segment  (as  specified 
in  registers  k,  1,  or  a  constant).  The  last  address  mode  bit  (bit  31)  (in  combination 
with  bit  30)  indicates  whether  an  ALU  operation  is  involved,  and  determines  the 
direction  of  a  segment  access.  It  also  determines  whether  general  or  special  registers 
are  being  accessed. 


mode  7 
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mode  0  register  i  <-  register  j  <operation>  register-k  [register-1] 

31  29  28  23  22  2019  1716  1413  11  10  0 


0  0  0 

type 

reg.  i 

regj 

reg.  k 

reg.  1 

00000000000 

mode  1 

31  29  28 

register  i  <-  register]  <operation>  register-k  [constant] 

23  22  2019  1716  14  13  0 

0  0  1 

type 

reg.  i 

reg.] 

reg.  k 

constant  (14  bit  unsigned) 

mode  2 

31  29  28 

register  i  <-  register  j  <operation>  register-1 

23  22  20  19  17  16  14  13  11  10  0 

0  1  0 

type 

reg.  i 

reg.] 

000 

reg.  1 

00000000000 

mode  3 

31  29  28 

register  i  <-  register]  <operation>  constant 

23  22  2019  17  16  15  0 

0  1  1 

type 

reg.  i 

reg.j 

E 

constant  (16  bit  sign  extended) 

mode  4 

31  29  28 

23  22  2019  1716 

register-k  [register-1]  <-  register-] 

14  13  1110  0 

1  0  0 

type 

0  0  0 

reg.j 

reg.  k 

reg.  1 

00000000000 

mode  5 

31  29  28 

23  22  2019  1716 

register-k  [constant]  <-  register-] 

14  13  0 

1  0  1 

type 

0  0  0 

reg.j 

reg.  k 

constant  (14  bit  unsigned) 

mode  6 

31  29  28 

register  i  <-  special  register  ]  <operation>  register-1 

23  22  2019  1716  1413  11  10  0 

1  1  0 

type 

reg.  i 

reg.j 

000 

reg.  1 

00000000000 

mode  7 

31  29  28 

special  register  i  <-  register  ]  <operation>  constant 

23  22  20  19  17  16  15  0 

1  1  1 

type 

reg.  i 

reg.] 

Q 

constant  (16  bit  sign  extended) 

Figure  5.5:  PiMac  Instruction  Formats 
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code 

description 

0 

clock  (low  32  bits) 

1 

clock  (high  32  bits) 

2 

node  ID 

3 

task  queue  0 

4 

task  queue  1 

5 

task  queue  2 

6 

task  queue  3 

7 

free  segments 

Table  5.6:  Special  Register  Codes 


class 

instruction  types 

arithmetic 

Add,  Subtract,  Multiply 

logical 

And,  Or,  Xor 

Arithmetic-Shift,  Logical-Shift,  Rotate 

comparison 

Compare 

memory  store 

Store 

tag  support 

Read-Tag,  Write-Tag 

sync  support 

Read-Count 

hash  support 

Hash,  Match 

branching 

Branch-Plus,  Branch-Plus-Zero,  Branch-Zero, 
Branch-Not-Zero,  Branch-Minus,  Branch-Minus-Zero 

task  support 

Resume,  Send,  Call,  Suspend,  End 

Table  5.7:  Instruction  Types 

These  instruction  formats  could  be  packed  more  tightly,  increasing  the  code  density. 
Perhaps  two  instructions  could  be  packed  into  a  single  word.  However,  this  would 
complicate  instruction  decoding,  and  cloud  the  clarity  of  the  machine  architecture. 
PiMac  is  intentionally  simple. 


Instruction  Types 

The  PiMac  instruction  set  architecture  includes  27  instruction  types.  A  type  defines 
the  operation  performed  by  an  instruction.  The  instruction  types  are  summarized  in 
Table  5.7. 


Add 


Subtract 


C  =  A  +  B 
C  =  A-  B 


P6 


Multiply 


C  =  A  *  B 


These  instruction  types  perform  two’s  complement  arithmetic  operations  on  integers, 
producing  an  integer  result.  These  operations  can  produce  overflow/underflow  traps. 


And 


Or 


C  =  A  A  B 
C  =  A  V  B 
C  =  A  ©  B 

These  instruction  types  perform  the  logical  operations  on  boolean  and  integer  vadues. 
The  logical  operation  Not  is  achieved  as  (Value  Xor  -1). 


Xor 


Arithmetic-Shift 


left 

"I  I  I  I  I  I  I 


Logical-Shift 


Rotate 


Ol  M^l  I  I  V— 

-14  I  iffl  I  I  V-*-  ^ 

pH  1  liyi  1 1  h-]  f-rri  I  I  I  I  I  1^ 


left 

TTr'l  I  I 


•0 

•0 


left 


These  instruction  types  shift  or  rotate  an  integer  by  a  specified  number  of  binary 
digits.  Arithmetic  shifts  preserve  the  sign  of  the  integer.  Logical  shifts  insert  zeros. 
Rotate  wraps  around  the  shifted  out  bit.  B  can  be  positive  or  negative.  Arithmetic 
shifts  can  cause  overflow/underflow  traps. 


C  =  compare  (A,  B) 

This  instruction  type  produces  an  integer  comparison  of  two  values.  The  values  can 
be  of  any  type,  but  they  must  be  the  same.  If  the  compared  values  are  integers  or 
boolean,  the  comparison  result  is  1  if  X  >  T,  0  if  X  =  T  and  -1  if  X"  <5^.  For  all 
other  types,  the  comparison  performs  an  equivalency  test,  zero  if  X  =  T,  non-zero  if 
^  y.  The  result  is  always  an  integer. 


Compare 


This  instruction  type  stores  a  value  in  a  segment.  The  value  can  be  any  type. 


Read- Tag 


C  =  tag  (A) 


tag  (C)  =  A 

These  instruction  types  access  the  16  bit  tag  of  a  segment  value  or  the  three  bit  tag 
of  a  register.  When  a  tag  is  read,  it  is  copied  to  the  least  significant  16  bits  of  the 
register  in  the  format  of  a  segment  tag.  The  format  is  the  same  as  the  tag  format 
specified  in  section  5.2.3.  Tags  are  written  using  the  same  format.  Register  tags 
are  copied  to  the  same  position  in  the  lower  16  bits.  All  other  tag  fields  are  zeroed. 
Copied  tags  are  typed  as  integers. 


Write- Tag 


Read-Count 


C  =  A 
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This  instruction  reads  the  value  of  a  word.  It  bypasses  the  read  lock  mechanism  so 
it  never  causes  a  task  suspension.  It  is  used  to  support  the  adjust-count  operation  in 
Pi. 

Hash  C  =  hash  (key,  size) 

This  instruction  type  computes  a  hash  index  using  the  specified  key  and  size.  The 
function  is  computed  using  specialized  hardware.  No  specific  hash  function  is  guar¬ 
anteed.  The  value,  key,  and  result  are  all  integers.  This  hash  function  is  identical  on 
all  nodes. 


This  instruction  type  performs  four  simultaneous  comparisons  to  quadword  aligned 
values  in  a  segment.  If  a  value  matches,  the  segment  offset  of  the  value  is  returned. 
If  no  match  is  found,  -1  is  returned.  The  key  is  an  integer;  the  segment  quadword  is 
specified  as  a  segment  pointer,  offset  pair.  The  resulting  offset  is  an  integer. 

if  (A  >  0),  IP  <—  IP  -f-  branch  offset 

if  (A  >  0),  IP  <—  IP  -f  branch  offset 

if  (A  =  0),  IP  4—  IP  -I-  branch  offset 

if  (A  7^  0),  IP  4—  IP  -}-  branch  offset 

if  (A  <  0),  IP  4—  IP  -f  branch  offset 

if  (A  <  0),  IP  4—  IP  -I-  branch  offset 

These  instruction  types  perform  a  conditional,  relative  branch  if  the  specified  test  is 
true.  The  test  variable  and  the  offset  are  integers. 

This  instruction  type  forces  a  suspension  of  the  active  task,  and  a  resumption  of  the 
task  segment  specified  by  the  instruction.  The  segment  specifier  must  be  a  segment 
pointer. 

Send  network  controller  4—  A 

This  instruction  type  gives  one  word  to  the  network  controller  to  be  sent  out  on 
the  network.  A  task  executing  the  send  instruction  waits  until  the  network  controller 
acknowledges  that  the  word  has  been  sent.  For  simplicity,  a  two  word  send  instruction 
is  not  included  here. 

Call  call  (task  A) 

This  instruction  calls  a  task  specified  by  a  segment  ID.  First  the  current  task  is 
suspended  and  inserted  on  the  head  of  the  highest  priority  task  queue.  Then  the 
specified  task  ID  is  invoked,  as  if  it  had  just  been  dequeued. 


Resume 


resume  (task  A) 


Match 


C  =  match  (key,  segment  quadword) 
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suspend  (active  task) 

This  instruction  type  forces  a  suspension  of  the  active  task.  The  dequeue  microse¬ 
quence  is  then  executed. 


Suspend 


End 


end  (active  task) 

This  instruction  type  ends  the  current  task  execution  thread.  The  dequeue  microse¬ 
quence  is  executed. 


Table  5.8  illustrates  the  addressing  mode/instruction  type  combinations  supported  in 
PiMac. 


5.2.6  Network 

PiMac  nodes  are  connected  via  a  three  dimensional  mesh  network.  Each  node  con¬ 
tains  a  piece  of  the  network  (a  router),  plus  a  network  controller.  The  router  and 
interconnect  topology  is  incorporated  directly  from  the  J-machine  [16,  20,  18,  54]. 
The  network  controller  has  several  distinguishing  characteristics. 

Like  the  controller  in  the  J-Machine,  the  PiMac  network  controller  sits  on  the  edges  of 
the  network  mesh  and  enqueues/dequeues  messages  for  a  specific  node.  However,  the 
PiMac  controller  has  no  output  buffering.  If  a  message  being  sent  into  the  network 
blocks,  the  task  executing  on  the  node  stalls  until  the  network  clears.  A  message 
sent  to  and  from  the  same  node  bypasses  the  network  controller,  so  a  trivial  deadlock 
situation  is  avoided. 

The  network  controller  buffers  an  incoming  message  in  the  input  buffer.  Once  the 
buffer  is  filled,  the  incoming  message  is  stalled  in  the  network  while  the  network 
controller  arbitrates  for  access  to  node  memory  to  write  the  buffer.  When  the  network 
controller  receives  the  first  word  of  the  message,  the  length,  it  blocks  the  message  until 
it  can  obtain  a  segment  equal  to  or  exceeding  the  storage  requirement.  It  then  absorbs 
the  message  from  the  network  and  writes  it  to  the  segment  four  words  at  a  time.  A 
segment  is  allocated  from  the  free  segment  list. 

.After  the  network  controller  has  stored  the  incoming  message  into  a  segment,  it 
inserts  it  onto  the  end  of  the  appropriate  prionty  message/task  queue.  The  queues 
are  maintained  by  the  task  queue  registers  in  the  register  architecture. 

5.2.7  Handlers 

Code  for  PiMac  is  stored  in  read-write  segments.  A  code  segment  implements  a 
specific  message  handler.  Handlers  are  identified  by  their  segment  ID.  There  are 
two  types  of  handlers:  resident  and  non-resident.  Resident  handlers  are  created 
(permanently)  on  every  node  at  system  initialization  time.  The  machine  dependent 
compiler  allocates  an  identical  segment  ID  for  each  resident  handler  on  all  nodes. 
Non-resident  handlers  are  not  present  on  a  node  unless  they  are  specifically  copied 
there.  They  are  not  guaranteed  to  have  identical  segment  IDs  on  all  nodes.  However, 
both  handler  types  are  invoked  identically. 
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Table  5.8:  Instruction/ Addressing  Mode  Map 
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0 

code  segment 

1 

instruction  pointer 

2 

register  A 

3 

register  B 

4 

register  C 

5 

register  D 

6 

register  E 

1 

register  F 

8 

message  length 

9 

message  priority 

10 

message  type 

11 

message  locals 

Table  5.9:  Task  Segment  Format 
5.2.8  Tasks  and  Messages 

.4  task  is  created  when  a  message  is  received  on  a  node.  The  message  contents  are 
stored  on  tne  node  in  a  task  segment.  A  task  segment  is  a  read-write  segment  which 
obeys  special  conventions  for  storage  in  the  bottom  twelve  words.  Table  5.9  shows 
the  format  of  a  task  segment. 

The  first  four  words  of  the  message  are  stored  in  the  task  segment  beginning  at  offset 
eight.  The  network  controller  reads  the  locals  (number  of  locals)  field  to  determine 
the  offset  for  the  first  argument.  Argument  zero  is  stored  at  offset  locals  +  12.  The 
network  controller  also  stores  the  message  type  field  in  the  segment  ID  field  of  the 
code  segment  (offset  0).  The  type  field  contains  the  segment  ID  of  the  appropriate 
message  handler.  The  instruction  pointer  is  also  zeroed  to  begin  the  execution  with 
the  first  word  of  the  code  segment. 

Only  255  segments  on  each  node  can  be  the  active  segment  for  a  task.  Since  segment 
names  are  used  as  task  names,  this  limits  the  number  of  tasks  that  can  exist  on  a 
node  simultaneously.  This  design  limit  is  based  on  the  expectation  that  most  tasks 
are  very  short  lived.  In  the  examples  in  Chapter  3,  over  99  percent  of  executed  tasks 
are  for  communication  only  (data  delivery).  They  execute  few  instructions  and  are 
non-blocking.  This  ephemeral  nature  means  the  probability  of  deadlock  on  a  node, 
caused  by  many  suspended  task  and  no  free  task  IDs,  is  unlikely.  If  a  node  does  run 
out  of  segment  IDs,  the  condition  is  handled  by  a  cat«istrophe  routine  that  handles 
errors  such  as  out  of  memory,  by  zero,  etc.  Typically,  this  handler  halts  activity  on 
the  node  and  issues  error  messages  to  the  host. 

Each  task  segment  includes  a  link  which  is  used  to  create  chains  for  task  queues  and 
lists.  The  task  ID  zero  is  reserved  as  an  end-of-chain  marker. 
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5.2.9  Microsequences 


There  are  two  task  management  microsequences  in  PiMac:  Suspend  and  Dequeue. 
They  are  supported  as  microsequences  rather  than  trap  handlers  because  of  their 
critical  impact  on  task  overhead.  When  a  microsequence  is  invoked,  normal  program 
execution  is  suspended  and  a  sequence  of  predefined  microoperations  is  performed. 
Microsequences  only  occur  at  the  completion  or  abortion  of  a  PiMac  instruction. 
Microsequences  cannot  cause  traps  or  invoke  other  microsequences. 


Suspend 

A  task  can  suspend  for  three  reasons:  (a)  the  task  accesses  a  locked  segment  location 
(i.e.,  a  synchronization  suspend),  (b)  the  task  executes  the  suspend  instruction,  or  (c) 
a  trap  event  occurs.  In  all  cases,  the  following  microsequence  occurs: 

1.  Registers  C,  D.  E.  and  F  are  written  as  a  quadword  to  the  active  segment 
beginning  at  offset  four. 

2.  The  code  segment,  instruction  pointer,  register  A,  and  register  B  are  written  as 
a  quadword  to  the  active  segment  beginning  at  offset  zero. 

3.  If  the  active  buffer  dirty  bit  is  set,  it  is  written  back  to  its  location  in  the  active 
segment. 

4.  If  the  other  buffer  dirty  bit  is  set,  it  is  written  back  to  the  appropriate  segment. 

For  synchronization  suspends,  several  additional  operations  are  performed  concur¬ 
rently  with  these  steps.  During  step  1,  the  task  ID  of  the  current  ta.sk  is  exchanged 
with  the  suspended  task  field  of  the  accessed  location.  The  instruction  pointer  is  also 
decremented  during  this  step.  During  step  2,  the  previous  suspended  ta.'^k  field  is 
stored  as  the  link  of  the  task  segment. 

For  suspend  operations,  two  additional  steps  occur: 

5.  The  tail  of  the  current  priority  queue  is  used  to  index  a  task  in  the  segment 
map. 

6.  The  first  quadword  of  the  indexed  task  in  loaded  into  the  other  buffer  and  its 
task  link  is  set  to  the  recently  suspended  task  ID.  The  tail  of  the  current  priority 
queue  is  set  to  the  suspended  task  ID  as  well. 

These  steps  enqueue  the  task  at  the  end  of  the  appropriate  suspended  task  queue. 

For  a  trap  event,  the  suspended  task  ID  is  stored  in  the  trap  task  ID  register. 

Consider  an  example  of  this  microsequencc.  Figure  5.6  illustrates  a  case  ..here  a  task 
(ID  5)  executes  a  blocked  access  on  the  second  word  (d2)  of  segment  (ID  7).  The 


102 


Figure  5.6:  Suspend  Microsequence  Example 
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slot  d2  already  ha5  one  suspended  task  (ID  3)  in  its  suspended  task  field.  In  step  1. 
registers  C  through  F  are  stored  in  the  appropriate  place  in  the  active  task  segment. 
The  suspended  task  field  of  the  accessed  value  is  exchanged  with  the  current  task  ID. 
Here,  task  ID  3  is  exchanged  with  task  ID  5.  Also  in  step  1,  the  instruction  pointer 
is  decremented. 

In  step  2.  register  A  and  B.  the  code  segment,  and  the  instruction  pointer  are  written 
to  the  active  segment.  During  this  write,  the  link  of  the  task  segment  (ID  .5)  is  set  to 
the  previous  head  of  the  suspended  task  chain  (ID  3)  of  word  d2. 

In  step  3.  the  active  buffer  is  written  back  to  the  appropriate  place  in  the  active 
segment  if  its  dirty  bit  is  set. 

Step  4  writes  the  other  buffer  back  to  segment  7.  This  also  updates  the  new  suspended 
task  chain  head  for  value  d2.  After  the  microsequence  completes,  the  state  of  task 
•T  has  been  preserved  in  the  task  segment,  and  the  task  has  been  pushed  onto  the 
suspended  task  chain  of  the  value  d2  of  segment  7. 


Dequeue 

This  microsequence  occurs  whenever  a  task  completes  or  is  suspended.  It  also  executes 
at  the  completion  of  a  trap  handler.  It  is  responsible  for  selecting,  dequeuing,  and 
starting  or  restarting  a  task.  The  following  steps  occur: 

1.  The  highest  priority,  non-empty  queue  is  selected.  The  head  pointer  is  used  to 
index  a  tcisk  in  the  segment  map. 

2.  The  first  quadword  of  the  index  task  is  loaded  into  register  A  and  B.  the  code 
segment,  and  the  instruction  pointer.  The  queue  head  pointer  is  set  to  the  value 
task  link.  The  active  segment  task  link  is  also  set  to  empty. 

3.  The  second  quadword  is  loaded  into  registers  C,  D,  E,  and  F. 

4.  If  the  code  segment  base  and  bound  fields  are  empty,  the  segment  is  loaded 
from  the  segment  map. 

Execution  is  then  turned  over  to  the  task  at  the  current  instruction  pointer.  For  a 
trap  handler  completion,  step  1  obtains  the  task  ID  from  the  trap  task  register  rather 
than  a  task  queue. 

An  example  dequeue  sequence  is  shown  in  Figure  5.7. 

In  step  I,  the  next  task  to  be  executed  is  identified  from  the  task  queue.  The  highest 
priority  queue,  zero,  is  empty  (indicated  by  e).  The  next  highest  priority  queue 
contains  three  t^lsks:  five,  three,  and  seven.  The  first  task,  five,  is  selected  as  the 
current  active  task.  It’s  base  and  size  are  loaded  from  the  segment  map.  In  step  2, 
the  code  segment,  instruction  pointer,  and  register  A  are  restored.  The  task  queue 
head  is  set  to  the  next  task  in  line  (ID  3).  Also,  the  tcisk  link  of  task  5  is  cleared.  In 
step  3.  the  remaining  processor  state  is  loaded.  Finally,  the  code  segment  base  and 
size  are  restored  from  the  segment  map  (assuming  they  were  empty)  in  step  4. 
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5.2.10  Traps 


The  PiMac  architecture  includes  a  trap  mechanism  to  handle  exceptions.  When  a  trap 
is  generated,  the  suspend  sequence  is  generated  and  the  suspended  task  ID  is  stored 
in  the  trap  task  register.  At  this  point,  traps  are  disabled,  preventing  unrecoverable 
double  trap  errors.  Then  the  trap  number  is  used  to  reference  the  trap  handler 
segment  ID  (which  is  loaded  at  system  initialization  time  with  the  resident  handlers). 
This  handler  is  invoked  using  a  trap  segment  that  is  reserved  for  use  as  the  active 
segment  for  traps.  When  the  trap  handler  completes,  the  dequeue  microsequence  is 
executed  using  the  trap  task  register  as  the  source  task.  Trap  handlers  contain  system 
code,  but  are  identical  in  format  to  other  handlers. 

Most  trap  handlers  perform  the  standard  responses  to  exceptions  (e.g.,  ALU  overflow, 
queue  empty,  etc.).  These  handlers  are  outside  the  scope  of  this  thesis.  The  following 
trap  handlers  perform  more  specific  functions  in  the  machine  architecture. 


Enqueue 

The  enqueue  trap  can  occur  (a)  if  a  task  performs  an  access  on  a  segment  location 
which  enables  the  suspended  tasks  to  be  requeued  for  execution,  (b)  a  task  executes 
the  suspend  instruction,  or  (c)  a  trap  occurs.  For  (b)  and  (c),  the  enqueue  trap  is 
executed  immediately  following  the  suspend  trap. 

This  trap  takes  a  suspended  task  chain,  and  inserts  it,  in  reverse  order,  on  the  ap¬ 
propriate  priority  task  queue.  The  chain  order  must  be  reversed  to  preserve  FIFO 
ordering  of  suspended  tasks.  The  trap  walks  down  the  suspended  task  chain,  revers¬ 
ing  the  chain  links.  Then  it  walks  back  up  the  chain  inserting  each  entry  on  the  tail 
of  a  task  queue.  The  task  queue  is  determined  by  accessing  the  message  priority  field 
from  the  taisk. 


Segment  Allocation 

This  trap  occurs  when  the  free  segment  list  becomes  empty,  or  when  an  abnormal 
sized  segment  is  required  by  the  network  controller.  The  trap  handler  allocates  one 
or  more  segments  and  adds  them  to  the  segment  free  list.  This  trap  handler  scans 
the  segment  map  for  free  segment  IDs. 

This  trap  does  not  violate  the  atomicity  requirement  of  Pi  since  the  trap  handier 
does  not  affect  any  state  related  to  the  executing  task.  It  is  an  implementation 
simplification. 


5.3  A  Gate  Array  Implementation 

In  this  section,  a  gate  array  implementation  of  the  PiMac  machine  architecture  is 
considered.  In  order  to  evaluate  the  feasibility  of  an  implementation,  estimated  gate 
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module 

device  count 

gate  count 

diagnostic  interface 

3,696 

616 

instruction  prefetcher 

3,168 

528 

controller 

8,373 

1,396 

address  datapath 

75,244 

12,541 

integer  datapath 

39,368 

6,561 

external  memory  interface 

8,990 

1,498 

internal  memory  interface 

13,373 

2,229 

network  interface  (in) 

4,450 

742 

network  interface  (out) 

18,214 

3,036 

router  (three  dimensions) 

30,540 

5,090 

total 

205.416 

34,236 

Table  5.10:  MDP  Device  and  Gate  Count  Estimates  (24  July  1990) 

counts  and  timing  of  a  gate  array  PiMac  are  examined.  This  section  draws  heavily 
on  the  design  of  the  Message  Driven  Processor  [14,  15).  The  specification  of  several 
modules  of  the  microarchitecture  is  taken  directly  from  corresponding  modules  in  the 
MDP.  Because  of  the  close  relationship  of  PiMac  to  the  MDP,  an  overview  of  the 
MDP  is  provided. 

5.3.1  The  Message  Driven  Processor 

The  MDP  is  being  currently  being  designed  in  the  Concurrent  VLSI  Architecture 
Group  at  MIT.  Work  began  on  the  MDP  in  1986.  Nearly  all  of  the  design  has  been 
completed  by  graduate  students.  The  design  team  has  averaged  around  six  to  eight 
members  over  the  design  period. 

The  architectural  specification  was  completed  in  the  Sununer  of  1988.  A  register  level 
specification  was  completed  in  Spring  of  1989.  The  hardware  architecture  is  imple¬ 
mented  using  an  Intel  standard  cell  library.  The  implementation  will  be  fabricated 
in  a  one  micron  CMOS  process  on  a  10  x  15  mm  die.  The  expected  clock  rate  is  15 
MHz. 

Table  5.10  summarizes  device  and  gate  estimates  for  the  MDP®,  It  is  difficult  to 
compare  the  library  cells  used  for  the  MDP  with  the  gate  array  gates  used  for  PiMac. 
The  gate  counts  provided  in  Table  5.10  are  computed  using  an  average  device  per 
gate  ratio  of  six.  The  device  per  cell  averages  are  higher  (between  eight  and  twelve), 
but  many  of  these  larger  cells  require  multiple  gates  to  implement. 

The  implementation  of  PiMac  described  here  uses  a  less  aggressive  gate  array  tech¬ 
nology.  The  PiMac  machine  architecture  is  in  many  ways  simpler  than  the  MDP. 

®This  table  includes  logic  only;  MDP  also  includes  4096  words  of  on-chip  memory  ('890,000 
devices) 
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However,  the  MDP  still  provides  the  most  accurate  information  for  estimating  details 
about  this  PiMac  implementation. 


5.3.2  PiMac  Module  Designs 

This  section  examines  each  module  of  PiMac’s  microarchitecture  with  respect  to  the 
corresponding  MDP  module  or  modules.  Gate  estimates  used  in  this  section  are 
derived  from  LSI  Logic  macrocells  [37,  40]. 


Router 

The  PiMac  router  design  is  taken  directly  from  the  MDP.  It  is  comprised  of  three 
independent  dimensions.  Each  dimension  of  the  router  buffers  a  flit  of  the  message 
passing  through  it,  and  compares  the  count  of  the  destination  field  to  the  node's 
address  (in  the  appropriate  dimension).  If  the  dimension  matches  the  node’s  address, 
the  message  is  passed  to  the  next  dimension  for  routing,  or  to  the  node  if  the  final 
dimension  is  routed.  Otherwise,  the  message  is  routed  towards  the  destination  in  the 
current  dimension. 

The  router  module  requires  roughly  5000  gates.  It  can  transmit  a  flit  in  one  clock, 
with  a  half  flit  transmitted  on  each  clock  edge. 


Network  Controller 

PiMac's  network  controller  differs  from  MDP’s  network  interface  in  several  ways. 
First,  PiMac  has  no  outgoing  message  buffer.  The  send  operation  takes  a  word  from 
the  datapath  and  inserts  it  directly  into  the  network  router  via  the  network  controller. 
The  next  instruction  cannot  begin  until  the  sent  word  has  started  entering  the  network 
router.  Because  of  this  difference,  the  gate  estimate  for  MDP’s  outgoing  network 
interface  is  higher  than  the  PiMac  counterpart. 

Second.  PiMac  network  controller  performs  a  more  complicated  procedure  for  receiv¬ 
ing  a  message  from  the  router,  since  it  must  determine  the  required  storage  size, 
allocate  the  storage  from  the  free  segment  list,  or  request  a  larger  segment  from  the 
processor.  While  no  additional  buffering  is  required,  the  control  mechanism  is  much 
more  complicated.  This  gate  complexity  is  comparable  to  that  of  the  MDP’s  outgoing 
network  interface. 

PiMac’s  network  controller  requires  roughly  5000  gates.  For  an  outgoing  message,  it 
has  a  delay  comparable  to  the  datapath.  For  incoming  messages,  the  network  interface 
requires  one  clock  to  compare  the  length  field  to  the  fixed  message  length,  one  clock 
to  allocate  the  segment  from  the  free  segment  list,  or  request  a  larger  segment  from 
the  processor,  and  one  clock  to  store  the  first  word  into  memory.  After  the  first  word 
is  stored,  additional  words  are  stored  every  two  clocks  (the  reception  rate  from  the 
router).  The  message  is  stored  into  memory  via  the  input  buffer. 
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(one  word  of  a  buffer) 


Figure  5.8:  Buffer  State  Machine 


Datapath 

PiMac's  datapath  is  similar  to  the  MDP’s.  It  provides  fewer  registers  (463  register 
bits  for  PiMac'  versus  1660  register  bits  for  MDP).  It  includes  logic  from  MDP's 
address  datapath,  integer  datapath,  and  internal  memory  interface. 

The  approximately  6500  gates  required  by  the  MDP  integer  datapath  is  comparable  to 
the  corresponding  datapath  in  PiMac.  PiMac  has  fewer  data  registers  (six  for  PiMac 
versus  twelve  for  MDP).  However.  PiMac's  registers  have  an  extra  port  to  allow  them 
to  be  accessed  as  a  quadword  with  memory.  The  datapaths  contain  similar  logic  in 
the  ALU. 

The  address  datapath  requires  a  more  detailed  comparison.  The  MDP’s  address  dat¬ 
apath  module  was  designed  to  support  a  wide  range  of  architectural  features  besides 
address  arithmetic.  Many  of  these  other  features  are  not  provided  in  PiMac.  For 
example,  PiMac  does  not  support  circular  queues,  ID  registers,  status  bits,  dedicated 
address  registers,  multiple  execution  priority  levels,  multiple  fault  handling,  or  special 
translation  addressing.  Most  of  the  features  unique  to  PiMac  are  supported  in  the 
segment  buffer  module. 

Using  a  breakdown  of  the  MDP’s  address  datapath,  an  estimated  3500  gates  are 
required  to  support  address  and  instruction  pointer  arithmetic.  This  brings  the  total 
gale  count  estimate  for  PiMac’s  datapath  to  10000. 

Segment  Buffers 

The  four  segment  buffers  provide  memory  caching  for  the  datapath,  instruction  unit, 
and  network  controller  (for  incoming  messages).  The  active  segment  and  other  seg¬ 
ment  buffers  also  include  simple  state  machines  on  each  word  to  support  access  be¬ 
haviors  for  attributes.  One  of  these  state  machines  is  shown  in  Figure  5.8. 

The  active  segment  and  other  segment  buffers  also  include  comparators  to  support 
the  match  instruction.  These  comparators  are  shown  in  Figure  5.9. 

"PiMac  contains  an  additional  828  register  bits  in  the  segment  buffers  but  they  are  not  counted 
here. 
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word  to  match  matched  word 

from  datapath  to  datapath 


Figure  5.9;  Buffer  Comparators 


buffer  type 

total  bits 

gates/bit 

total  gates 

active  and  other 

384 

7 

2,688 

input 

192 

6 

1,152 

instruction 

192 

4 

768 

key  and  dirty  bits 

60 

4 

240 

total 

828 

4,848 

Table  5.11;  PiMac  Buffer  Gates  (not  including  comparators  and  selection  logic) 

There  are  four  buffers  in  this  module.  Each  buffer  consists  of  four  46  bit  words,  a 
dirty  bit,  and  a  14  bit  tag.  Two  of  the  four  buffers  are  used  in  conjunction  with  the 
datapath  (active  and  other).  Each  bit  of  these  buffers  is  gated  on  four  busses  (three 
for  datapath,  one  for  memory  interface).  A  gate  array  implementation  of  a  buffer  bit 
requires  seven  gates.  The  input  buffer  contains  an  internal  shift  register.  Both  the 
input  and  instruction  registers  have  a  single  input  and  output  port.  Input  buffer  bits 
require  six  gates,  and  an  instruction  buffer  bits  require  four  gates.  The  total  estimate 
for  buffer  storage  is  summarized  in  Table  5.11. 

An  additional  1000  gates  are  required  for  the  comparators  and  500  gates  for  selection 
logic.  Therefore,  the  total  gate  estimate  for  the  segment  buffers  module  is  6500  gates. 
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module 

gate  count 

network  router 

5,000 

network  controller 

5,000 

datapath 

10,000 

segment  buffers 

6,500 

memory  controller 

1,500 

instruction  unit 

500 

controller 

1,500 

total  gates 

30,000 

Table  5.12;  PiMac  Module  Summary 


Memory  Controller 

PiMac’s  memory  controller  module  is  taken  directly  from  MDP’s  external  memory 
interface.  It  should  therefore  require  roughly  1500  gates  to  implement.  The  external 
memory  interface  accesses  a  quad  word  of  external  memory  as  16  twelve  bit  chunks. 
The  entire  access  takes  approximately  720  ns.  An  on-chip  memory  cache  is  considered 
in  Section  5.3.4. 


Instruction  Unit 

The  instruction  unit  fetches  and  decodes  instructions  in  parallel  with  instruction 
execution  in  the  datapath.  Instructions  are  fetched,  four  at  a  time,  into  the  instruction 
buffer.  Simple  logic  predecodes  the  address  mode  and  instruction  type  and  presents 
it  to  the  controller.  Because  of  the  simpler  (but  less  compact)  instruction  format  of 
PiMac,  a  slightly  smaller  instruction  unit  is  expected.  The  gate  estimate  is  500  gates. 


Controller 

PiMac’s  controller  is  comparable  to  the  MDP’s  controller.  Both  architectures  have  a 
similar  number  of  features  that  contribute  to  the  controller’s  complexity.  For  example, 
PiMac  has  simpler  fault  handling  than  the  MDP,  but  it  includes  two  “hardwired” 
microsequences.  The  estimate  for  PiMac’s  controller  is  1500  gates. 

PiMac  module  gate  count  estimates  are  summarized  in  Table  5.12. 


5.3.3  PiMac  Timing 

Like  the  MDP,  it  is  expected  that  PiMac  is  designed  with  a  maximum  of  eight  gate 
delays  per  clock  phase  (16  total).  Most  PiMac  instruction  types  can  execute  in  a 
single  clock  cycle.  However,  several  conditions  affect  the  instruction  rate. 
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Segment  Buffer  Accesses 


When  an  instruction  accesses  a  word  in  a  segment,  the  segment  address  is  compared 
with  the  tags  of  the  active  and  other  segment  buffers.  If  there  is  a  match,  the  correct 
word  is  selected  and  made  available  to  the  datapath  in  the  next  cycle.  This  means 
that  instructions  that  use  addressing  modes  0,  1,4,  and  5  execute  in  a  minimum  of 
two  clocks. 

If  the  accessed  word  is  not  in  the  segment  buffer,  it  must  be  loaded  from  memory.  This 
process  involves  freeing  either  the  active  or  other  segment  buffer.  If  the  appropriate 
buffer’s  dirty  bit  is  set,  the  buffer  must  first  be  written  back  to  memory.  Then  the 
accessed  word  can  be  loaded.  The  extra  cycles®  required  for  these  memory  accesses 
are  added  to  the  instruction  time. 


Instruction  Buffer  Accesses 

The  instruction  buffer  holds  four  instructions.  When  the  instruction  unit  fetches  an 
instruction  found  in  the  buffer,  it  is  accessed  in  a  single  cycle.  When  an  instruction 
is  not  present  in  the  buffer,  the  appropriate  quadword  of  the  code  must  be  fetched 
from  memory.  This  also  adds  extra  cycles  to  an  instruction  time. 


Microsequence  Invocations 

Microsequences  can  be  invoked  either  directly  (via  the  resume,  call,  suspend,  or  end 
instructions),  or  indirectly  by  a  segment  access.  In  either  case,  the  microsequence 
e.xecution  time  is  added  to  the  instruction  time. 

5.3.4  A  PiMac  Gate  Array 

The  proposed  PiMac  gate  array  contains  approximately  30,000  gates  and  executes  at 
a  cycle  time  of  16  gate  delays.  This  could  be  realized  using  several  commercial  gate 
arrays.  For  example,  an  LSI  Logic  LCAIOOK  gate  array  (.7  micron  HCMOS)  could 
accommodate  this  design  with  a  clock  rate  of  up  to  50  MHz  (20  ns)  [38]. 

A  more  interesting  alternative  is  the  embedded  array  gate  arrays  which  combine  gate 
array  logic  with  a  high  density  static  memory  array  on  a  single  die.  For  example, 
the  LEAlOOK  series  would  provide  an  estimated  1000  words  of  memory  on-chip  with 
the  30.000  gate  PiMac  design  [39].  This  would  reduce  the  memory  access  delay  for  a 
quadword  (192  bits)  from  16  external  memory  cycles  (720  ns),  to  a  single  quadword 
on-chip  memory  access  (40  ns). 

A  higher  performance  implementation  of  the  PiMac  machine  architecture  could  be 
constructed.  This  implementation  has  selected  because  it  is  (a)  simple  to  design,  and 
(b)  it  is  not  aggressive  with  respect  to  hardware  techniques  and  technology.  It  is  the 

*The  number  of  extra  cycles  depends  of  the  memory  access  speed. 
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kind  of  implementation  that  can  be  undertaken  in  a  university  research  group.  If 
a  commercial  version  of  a  PiMac-like  machine  architecture  is  ever  produced,  it  will 
certainly  incorporate  more  advanced  techniques  and  technology  producing  a  much 
higher  performance  machine.  This  gate  array  implementation  defines  a  baseline  Pi 
substrate. 


5.4  Summary 

This  chapter  presents  PiMac,  a  Pi  machine  architecture.  Its  design  is  based  on  the 
requirements  of  Pi,  and  the  composite  model  mechanism  characteristics  from  Chapter 
4.  PiMac  has  also  been  strongly  influenced  by  the  design  of  the  Message  Driven 
Processor  at  MIT.  The  significant  features  of  PiMac  include: 

fast  task  suspends  and  dequeues  (four  memory  cycles  each) 

fast  task  allocation/deallocation  via  the  free  segments  list 

multiple  task  queues  for  priority  messages 

autonomous  network  router 

low  overhead  for  message  sending  and  receiving 

segment  buffers  for  low  memory  access  latency 

segment  addressing  modes  with  bounds  checking 

low  cost  synchronization  with  per  word  granularity 

support  for  associative  segments 

This  chapter  also  described  a  proposed  implementation  of  PiMac  on  a  single  gate 
array  with  1000  words  of  on-chip  memory  and  a  clock  rate  in  excess  of  25  MHz. 

While  PiMac  is  still  only  a  proposed  system®,  it  ratifies  the  goal  of  this  chapter:  to 
demonstrates  the  feasibility  of  a  Pi  substrate.  The  effectiveness  of  this  substrate  in 
supporting  Pi  is  the  subject  of  the  next  chapter. 


*The  actual  realization  of  PiMac  will  be  pursued  in  future  research. 
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Chapter  6 
Pi  on  PiMac 


This  chapter  examines  the  support  of  the  Pi  interface  on  PiMac.  The  match  of  the  Pi 
abstract  machine  to  the  PiMac  machine  architecture  is  considered.  Some  functions  of 
the  machine  dependent  compiler  and  the  runtime  operating  system  are  also  discussed. 
To  avoid  confusion,  the  term  instruction  is  used  to  denote  PiMac  machine  instructions; 
the  term  operation  is  used  to  denote  Pi  abstract  machine  operations.  The  compiler 
and  runtime  systems  described  in  this  chapter  were  influenced  by  [27,  28,  29,  57]. 


6.1  Storage 

In  both  the  Pi  abstract  machine  and  the  PiMac  machine  architecture,  storage  is  pre¬ 
sented  as  part  of  a  complete  node.  This  provides  locality  between  the  processing  and 
storage  elements  of  the  system.  Read-Write  and  associative  segments  are  supported 
using  the  same  local  memory  in  the  PiMac  machine  architecture. 


6.1.1  Read- Write  Storage 

PiMac’s  instruction  set  architecture  is  designed  to  provide  efficient  support  for  read- 
write  segment  accesses.  Segment  off-set  addressing  with  bounds  checking  is  an  address¬ 
ing  mode  for  most  instruction  types.  Segment  buffers  cache  named  segment  locations, 
reducing  the  latency  of  these  accesses.  Register  addressing  modes  are  also  provided 
to  reduce  the  load  on  the  memory  system  and  simplify  the  instruction  formats. 

A  Pi  read  operation  is  supported  by  addressing  modes  0  and  1.  One  read  and  a 
dyadic  operation  (e.g.,  add)  is  supported  by  a  single  PiMac  instruction.  If  the  other 
operand  for  the  dyadic  operation  is  not  present  in  a  register,  it  is  placed  in  one,  using 
a  pseudo-load  instruction  (e.g.,  (or  operand  0)). 

The  machine  cycle  cost  of  performing  a  dyadic  Pi  operation  depends  on  where  the 
operands  reside.  Table  6.1  summarizes  the  cost  for  each  case.  Since  some  operations 
are  not  commutative,  operand  order  can  affect  the  cost.  This  table  assumes  that  (a) 
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operand  A 

operand  B 

cost 

mode 

register 

register 

mode  2 

register 

constant 

1  cycle 

mode  3 

constant 

register 

2  cycles 

mode  2* 

register 

segment 

2  cycles 

mode  0 

segment 

register 

3  cycles 

constant 

3  cycles 

segment 

constant 

3  cycles 

segment 

segment 

Table  6.1:  Dyadic  Operation  Costs 

segment  locations  are  present  in  the  segment  buffers,  and  (b)  the  operation  destination 
is  a  register.  If  the  operands  are  in  segment  locations  that  are  not  present,  the  cost 
of  loading  them  must  be  added  to  the  total.  If  the  result  must  be  written  to  a 
segment,  an  additional  cycle  is  required.  Additional  time  is  also  required  if  the  access 
of  an  operand  results  in  an  enqueue  trap  (the  time  is  dependent  on  the  length  of  the 
suspended  tasks  list). 

A  Pi  write  operation  is  supported  directly  by  the  store  instruction  (addressing  modes 
4  or  5).  This  instruction  requires  two  machine  cycles.  Sometimes  the  machine  depen¬ 
dent  compiler  can  eliminate  segment  reads  and  writes  by  retaining  results  in  registers. 
This  is  possible  if  (a)  the  segment  location  is  not  attributed,  and  (b)  other  tasks  do 
not  read  the  value  in  the  segment.  By  retaining  local  and  intermediate  values  in 
registers,  segment  lengths  can  also  be  reduced. 

Read- Write  segment  allocation  is  supported  by  the  runtime  portion  of  the  operating 
system.  Since  storage  and  segment  ID  allocation  are  a  critical  part  of  task  overhead. 
PiMac  supports  special  mechanisms  and  techniques  for  efficient  read-write  segment 
allocation. 

Most  read- write  segments  used  in  the  examples  in  Chapter  3  are  small  and  ephemeral. 
Therefore  the  runtime  system  uses  a  fixed  segment  size  for  nearly  all  segment  requests. 
In  Chapter  4,  it  was  shown  that  95.8%  of  segments  in  the  example  are  less  than  eight 
words  long.  99.0%  are  less  than  16  words  long.  Since  PiMac  requires  an  additional 
eight  words  for  processor  state,  a  fixed  size  of  24  words  would  satisfy  most  segment 
needs. 

Free  segments  of  this  fixed  size  are  maintained  by  the  free  segment  list  on  the  PiMac 
register  architecture.  This  supports  very  fast  segment  allocation  for  the  most  common 
type  of  requested  segment  (segments  less  than  24  words).  The  network  controller  can 
allocate  a  segment  from  this  list  in  two  machine  cycles.  A  Pi  create-read-write-segment 
operation  allocates  a  fixed  sized  segment  with  the  following  sequence: 

*The  constant  operand  is  first  preloaded  into  a  register. 

^The  segment  operand  is  first  preloaded  into  a  register. 
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1 .  Compare  the  requested  segment  size  to  the  fixed  size. 

2.  If  it  is  too  long,  branch  to  a  more  complicated  allocation  routine. 

3.  Read  the  head  of  the  free  segments  list. 

4.  Read  the  segment  link  (the  suspended  tasks  field  of  word  zero). 

5.  Write  the  new  head  of  the  free  segment  list. 

This  sequence  takes  six  machine  cycles.  Read- Write  segments  that  are  larger  than 
the  fixed  size  are  handled  by  a  slower,  heap  storage  management  mechanism.  The 
network  controller  generates  a  segment  allocation  trap  if  a  large  segment  is  required. 

Since  segment  IDs  are  retained  in  the  free  segment  list,  ID  allocation  is  provided 
automatically,  except  for  the  large  segment  case,  where  the  segment  map  is  scanned 
for  an  unused  ID. 

The  machine  dependent  compiler  predefines  certain  segment  descriptors  (including 
the  segment  IDs)  for  resident  handler  code  segments,  the  nodal  segir  nt.  trap  han¬ 
dler  code  and  local  variable  segments.  It  also  defines  nodal  variable  fsets.  These 
descriptors  and  offsets  are  compiled  in-line,  avoiding  more  expensive  runtime  lookups. 
This  allows  these  objects  to  be  accessed  quickly  (in  two  cycles  if  the  appropriate  value 
is  in  the  segment  buffer). 

6.1.2  Associative  Segments 

Associative  segments  are  required  much  less  frequently  than  read-write  segments. 
When  created,  they  have  a  much  longer  lifetime.  Because  of  this,  less  emphasis  is 
placed  on  fast  allocation.  A  Pi  Create-Associative-Segment  is  compiled  into  a  sequence 
of  instructions  that  allocates  a  read-write  segment  of  the  correct  size,  and  initializes 
it  as  an  associative  segment.  For  an  eight  word  segment,  this  takes  approximately  45 
machine  cycles. 

An  associative  segment  of  size  N  contains  2A^-l-4  slots.  The  first  N  slots  contain  keys; 
the  second  N  slots  contain  values.  The  extra  four  slots  are  used  for  segment  type 
information.  When  a  key  is  looked  up  in  the  segment  (using  the  match  operation), 
the  hash  operation  is  computed  on  the  key  and  segment  entry  size.  The  computed 
haish  value  is  used  to  index  a  quadword  in  the  associative  segment.  Then  a  segment 
buffer  is  used  as  a  four-way  set-associative  cache  to  attempt  to  match  the  entry.  If 
a  match  is  found,  the  value  is  accessed  from  the  value  slots  in  the  segment  using  the 
sum  of  the  hash  value  and  the  match  result  as  an  offset.  This  procedure  takes  five 
machine  cycles. 

The  size  and  safety  parameters  play  a  key  role  in  supporting  the  insert  operation. 
For  SAFE  segments,  the  insert  operation  must  first  test  to  see  if  the  slot  in  the 
segment  is  currently  holding  another  entry.  For  a  collision,  the  boundedness  of  a 
segment  determines  the  behavior.  For  UNBOUNDED  segments,  the  insert  operation 
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must  resize  the  segment  by  allocating  another  read-write  segment  and  rehashing  the 
entries.  For  BOUNDED  segments,  the  insertion  is  rejected.  For  UNSAFE  segments, 
collisions  can  result  in  the  overwriting  of  an  entry,  or  a  resizing  of  the  segment, 
depending  on  the  segment  boundedness. 

The  remove  operation  employs  the  hash  and  match  instruction  to  locate  and  remove 
the  specified  entry  (six  cycles).  The  clear  operation  reinitializes  the  associative  seg¬ 
ment.  For  an  eight  element  associative  segment,  this  takes  approximately  40  cycles. 

It  is  unlikely  that  any  Pi  substrate  will  support  all  associative  segment  types  in  hard¬ 
ware.  The  most  straightforward  type  is  BOUNDED-UNSAFE,  since  this  is  supported 
by  a  conventional  set  associative  memory.  Other  associative  segment  types  can  be 
supported  either  by  software,  or  by  some  combination  of  software  and  hardware.  In 
PiMac,  the  hash  and  match  instructions  provide  a  balance  of  efficiency  and  resource 
cost  for  all  associative  segment  types. 


6.1.3  Storage  Reclamation 

Storage  can  be  explicitly  deallocated  using  the  destroy-segment  operation.  If  the 
segment  is  a  fixed-sized,  read-write  segment,  it  is  placed  on  the  free  segment  list. 
Otherwise  it  is  marked  as  free  for  later  reclamation  at  the  next  heap  compaction. 
Before  placing  a  segment  on  the  free  list,  each  word  must  be  reinitialized  to  an  empty 
d-sync.  If  eight  words  are  used,  destroy-segment  costs  approximately  20  machine 
cycles. 

PiMac  supports  heap  compaction  via  the  segment  map.  Since  only  segment  IDs 
are  passed  out  of  a  node,  segments  can  be  moved  freely  within  a  node  as  long  as 
the  segment  map  is  updated,  and  internal  segment  descriptors  are  adjusted.  After 
a  compaction,  the  base  and  size  fields  of  each  segment  descriptor  (identified  by  the 
type  field)  are  zeroed.  On  the  next  access,  the  updated  fields  are  restored  from  the 
segment  map. 


6.1.4  Nodals 

Nodals  are  stored  in  a  special  segment  on  each  node.  A  nodal  is  accessed  by  refer¬ 
encing  the  nodal  segment  descriptor  using  the  nodal  offset.  Both  the  nodal  segment 
descriptor  and  the  nodal  offset  are  predefined  by  the  machine  dependent  compiler. 


6.2  Synchronization 

The  synchronizations  defined  in  Pi  are  supported  as  attributed  locations  in  read-write 
segments.  These  attributes  are  supported  directly  in  PiMac  via  the  access  pattern, 
access  locks,  and  suspended  tasks  fields  of  a  segment  location’s  tag. 
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6.2.1  Data  Synchronization 


When  a  location  is  attributed  as  D-SYNC,  the  tag  of  the  location  is  accessed  using  the 
read-tag  instruction.  The  read  lock  bit  is  set,  the  d-sync  access  pattern  is  stored,  and 
the  suspended  tasks  list  is  cleared.  The  location  tag  is  updated  using  the  write-tag  in¬ 
struction.  The  buffer  access  state  machine  suspends  (via  the  suspend  microsequence) 
any  task  attempting  to  read  the  location  until  it  is  written.  When  written,  the  buffer 
access  state  machine  initiates  a  requeue  microsequence  of  the  suspended  task  list  after 
the  operation  has  completed. 


6.2.2  Barrier  Synchronization 

Barrier  synchronizations  arc  initialized  by  setting  the  access  pattern  to  b-sync,  clear¬ 
ing  the  access  locks,  and  zeroing  the  value  fields.  The  adjust-count  operation  uses 
the  read-count  instruction  to  access  the  value  field  (thus  avoiding  a  possible  suspen¬ 
sion).  The  count  is  adjusted  and  tested.  If  greater  than  zero,  the  read  lock  bit  is 
set.  Otherwise,  the  read-lock  bit  is  cleared.^  The  new  count  is  stored  in  the  value 
field.  The  test-count  operation  is  supported  by  reading  the  value  field  of  the  barrier 
synchronization. 


6.2.3  Producer-Consumer  Synchronization 

Producer-consumer  synchronizations  are  initialized  in  the  same  way  as  data  synchro¬ 
nizations.  except  the  access  pattern  is  set  to  s-sync.  The  buffer  access  state  machine 
controls  the  behavior  of  the  synchronization  as  defined  in  Pi. 


6.2.4  Attributes 

In  PiMac.  attributes  are  supported  as  fields  of  every  word  in  local  memory.  This  adds 
a  13  bit  overhead  to  every  word,  yet  it  provides  fast  and  selective  synchronizations.  An 
alternate  approach  is  to  provide  a  common  suspended  t£isk  list  for  an  entire  segment. 
This  reduces  the  attribute  overhead  by  eight  bits  per  location,  and  allows  longer 
task  IDs.  However,  this  approach  has  less  selectivity,  since  all  suspended  tasks  are 
requeued  together,  independent  of  which  synchronization  becomes  ready.  Also,  since 
the  suspended  task  list  is  not  located  with  the  synchronization,  it  must  be  loaded 
whenever  a  suspension  or  requeue  occurs.  Since  this  usually  involves  several  loads 
and  stores  of  a  buffer,  the  overhead  of  a  synchronization  is  increased. 

The  probe  operation  is  supported  by  the  read-tag  instruction.  The  time  operation  is 
supported  by  the  node  clock  registers. 

^The  read  lock  bit  is  only  written  if  its  value  changes. 
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6.3  Communication 


The  principal  communication  operation,  send,  is  supported  directly  by  the  send  in¬ 
struction.  This  instruction  injects  one  word  (32  bits  value,  three  bits  type)  into  the 
network.  It  is  important  that  the  processor  supply  data  as  fast  as  the  network  can 
accept  it  to  prevent  message  “bubbles”.  Since  most  messages  are  short,  the  compiler 
can  preload  message  data  in  registers  and  buffers.  Some  message  data  is  also  stored 
as  constants  in  the  code  sequence  (e.g.,  the  end-of-message  word).  To  send  an  eight 
word  message,  eight  send  instructions  are  executed.  If  half  of  the  message  is  constants 
or  registers,  and  half  is  in  segments,  the  eight  word  message  takes  twelve  machine 
cvcles  to  transmit. 


6.3.1  The  Network 

The  PiMac  network  conforms  to  the  properties  of  the  Pi  interface.  In  this  PiMac 
implementation,  message  order  between  two  nodes  is  preserved.  This  will  change  as 
adaptive  routing  techniques  are  incorporated  in  the  future. 

The  PiMac  network  exerts  immediate  backpressure  on  a  node  by  blocking  the  send 
instruction,  halting  the  processor.  The  active  task  is  not  suspended.  The  processor 
waits  until  the  network  accepts  the  word  being  sent. 

In  a  superior,  but  more  complicated  approach,  the  message  sending  task  is  suspended, 
and  communication-only  messages  (that  are  guaranteed  not  to  block  or  send  addi¬ 
tional  messages)  are  executed.  This  frees  local  storage  space  without  creating  ad¬ 
ditional  tasks,  but  there  is  a  greater  risk  of  creating  a  bubble  in  the  message  being 
sent. 

Both  techniques  use  the  network  for  concurrency  control.  The  network  acts  as  a  con¬ 
currency  moderator,  slowing  down  message  sends  (and  task  creations)  as  the  network 
becomes  more  heavily  loaded.  More  sophisticated  network  moderation  techniques  em¬ 
ploy  time  averaging,  and  network  router  communication  messages  to  control  system 
concurrency  more  effectively.  These  techniques  will  be  addressed  in  future  research. 


6.4  Task  Management 

The  PiMac  machine  architecture  is  designed  to  support  low  overhead  tasks.  Effi¬ 
cient  message  sends  and  faist  segment  allocation  contribute  to  this  goal.  The  section 
considers  other  architectural  features  that  reduce  task  overhead. 

6.4.1  Task  Storage 

Task  storage  is  obtained  using  the  segment  allocation  mechanisms  described  in  Section 
6.1.1.  The  operation  self  is  supported  by  the  Rs  register.  It  always  contains  the  active 
task  segment  descriptor. 
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6.4.2  Task  Dispatch 


Every  message  contains  a  type  field  that  specifies  the  segment  ID  of  the  resident 
handler  to  be  executed.  When  the  message  is  dispatched,  the  type  field  is  used  to 
invoke  the  correct  handler.  The  machine  dependent  compiler  defines  the  segments 
IDs  for  all  resident  handlers.  The  operating  system  uses  these  IDs  when  it  distributes 
the  resident  handlers  to  each  node.  This  technique  supports  task  dispatch  via  a  single 
reference  to  the  segment  map  (two  machine  cycles). 

6.4.3  Task  Atomicity 

The  definition  of  Pi  requires  that  an  executing  task  cannot  be  preempted  by  the 
arrival  of  another  task.  PiMac  conforms  with  the  intent  of  this  requirement  (i.e.,  task 
execution  is  atomic  except  for  synchronization  or  explicit  suspensions).  The  PiMac 
machine  architecture  does  allow  suspensions  of  the  active  task  (via  traps)  to  support 
several  runtime  system  services  (e.g.,  segment  allocation,  heap  compaction,  etc.).  But 
these  trap  handlers  do  not  interfere  with  the  atomic  execution  of  the  active  task. 

The  suspend  operation  is  supported  directly  by  the  suspend  instruction  in  six  machine 
cycles.  When  a  task  executes  a  destroy-segment  operation  on  (self),  the  active  segment 
is  deallocated,  and  the  end  instruction  relinquishes  control  of  the  processor. 

6.4.4  Task  Prioritization 

Each  task  includes  a  priority  word  that  is  used  in  scheduling.  Pi  provides  an  arbi¬ 
trary  number  of  priority  levels,  but  does  not  guarantee  strict  correctness  of  message 
scheduling.  PiMac  supports  four  priority  queues  with  registers  in  the  architecture. 
This  provides  low  overhead  queuing  and  dequeuing  of  tasks.  The  machine  dependent 
compiler  can  usually  map  an  arbitrary  number  of  priority  levels  in  a  Pi  program  into 
the  four  levels  provided  by  PiMac.  If  runtime  computed  priorities  are  employe  i,  only 
the  least  significant  two  bits  are  used.  Alternatively,  the  compiler  can  remove  runtime 
determined  priorities,  and  use  a  different  scheme  for  priority  selection. 


6.4.5  Calling  Tasks 

The  call  operation  is  supported  by  the  call  instruction.  Atomicity  is  preserved  by 
inserting  the  calling  task  onto  the  head  of  the  highest  priority  queue. 

The  return  operation  is  supported  by  a  special  nodal  variable  that  contains  the  return 
value.  The  use  of  a  single  return  value  does  not  prohibit  nested  called  handlers,  but 
the  machine  dependent  compiler  must  move  the  return  value  store  to  the  end  of  the 
called  ta^k. 
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6.4.6  Variable  Argument  Passing 


Variable  argument  passing  is  done  in  Pi  via  the  send-segment  and  call-segment  op¬ 
erations.  These  operations  are  supported  using  the  send  and  call  instructions.  The 
machine  dependent  compiler  can  often  eliminate  the  additional  segment  if  (a)  it  is 
generated  within  the  Pi  task  that  invokes  the  send-segment  or  call-segment  operation, 
and  (b)  adequate  storage  is  available  in  the  active  segment. 


6.5  Locality 

Locality  is  primarily  supported  by  the  machine  dependent  compiler  and  the  runtime 
system.  The  proximity  information  provided  in  the  another-node-id  operation  can 
be  used  either  at  compile  time,  runtime,  or  both  to  decide  where  a  task  should  be 
created. 

The  node-id  operation  is  supported  in  PiMac  by  the  node  ID  register.  It  is  not 
generally  possible  to  reschedule  a  task  specified  to  run  on  (node-id)  on  another  node, 
since  it  may  reference  another  segment  that  resides  on  the  same  node. 

If  the  topology  of  a  specific  Pi  substrate  is  known,  Pi  abstract  node  numbers  can 
be  used  to  best  exploit  locality.  If  a  Pi  program  that  explicitly  uses  these  abstract 
node  numbers  is  transported  to  another  substrate  (with  a  different  topology),  it  will 
execute  correctly,  although  not  with  the  same  locality-dependent  performance. 

This  is  true  because  all  possible  abstract  node  numbers  are  mapped  to  a  physical 
node  in  the  Pi  substrate.  Pi  programs  that  are  likely  to  be  executed  on  several  Pi 
substrates  (with  different  topologies  and  number  of  physical  nodes),  may  have  better 
overall  performance  if  generic  locality  specifications  are  used.  This  allows  the  compiler 
and  runtime  system  more  flexibility  in  scheduling  tasks. 


6.6  Sequential  Operations 

All  the  sequential  operations  defined  in  Pi,  except  divide,  are  supported  by  a  single 
PiMac  instruction.  The  divide  operation  is  supported  by  a  short  sequence  of  arithmetic 
instructions  (approximately  25  machine  cycles). 


6.7  Summary 

Table  6.2  summarizes  the  costs  of  Pi  operations  on  PiMac.  The  Tium6er  field  indicates 
the  dynamic  frequency  of  the  operation  in  the  examples  in  Chapter  3.  The  ops  and 
~  fields  indicate  the  number  of  PiMac  instructions  and  machine  cycles  necessary  to 
support  the  operation. 
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operation 

number 

ops 

operation 

number 

ops 

adjust-count 

237,634 

2 

4 

insert 

5,403 

5 

8 

and 

55,268 

1 

2 

1- shift 

0 

1 

2 

another-node-id 

6,386 

2 

2 

match 

50,572 

3 

5 

attribute 

39,344 

1 

701,767 

1 

2 

a-shift 

29,078 

1 

2 

mod 

35,449 

20 

25 

branch-minus 

11,160 

1 

2 

move 

565,131 

1 

2 

branch-not-minus 

2,273 

1 

2 

not 

0 

1 

2 

branch-not-plus 

0 

1 

2 

or 

8,730 

1 

2 

branch-not-zero 

1,129,059 

1 

2 

plus 

902,473 

1 

2 

branch-plus 

1 

1 

2 

probe 

740 

2 

3 

branch-zero 

76,640 

1 

2 

remove 

1,625 

3 

6 

call 

1 

6 

return 

1 

2 

call-segment 

256 

1 

6 

send 

488,760 

8 

12 

clear 

0 

30 

40 

send-segment 

64 

8 

12 

compare 

667,282 

1 

2 

suspend 

1 

6 

create-associative 

429 

35 

45 

test 

150 

1 

2 

create-read-write 

1,541 

5 

6 

test-count 

48,054 

1 

2 

destroy-segment 

533,956 

11 

20 

time 

0 

1 

1 

distance 

2,428 

8 

8 

1,066,416 

1 

2 

divide 

10,000 

20 

25 

xor 

0 

1 

2 

initialize 

65,810 

1 

2 

average 

6,843,848 

2.5 

w 

Table  6.2:  Pi  Operation  Cost  Summary 


122 


The  Pi  operations  self,  read,  and  write  are  excluded  from  this  table  since  their  cost  is 
included  in  other  Pi  operations.  The  pseudo-operations  initialize,  move,  and  test,  are 
provided  for  cases  where  reads  and  writes  occur  in  the  absence  of  other  operations. 

Dyadic  operations  assume  the  case  where  the  A  operand  is  in  a  register  and  the  B 
operand  is  in  a  segment  (two  cycles).  This  is  the  anticipated  average. 

The  average  presented  in  the  table  is  weighted  by  the  frequency  of  occurrence  in  the 
examples  (dynamic  average).  The  static  average  for  all  Pi  operations  is  4.6  PiMac 
instructions  and  6.9  PiMac  machine  cycles. 

Note  that  many  Pi  operations  are  supported  directly  by  a  PiMac  instruction"*.  It 
was  originally  intended  that  each  logical  instruction  (from  Chapter  4)  would  be  rep¬ 
resented  as  a  single  PiMac  machine  instruction.  However,  some  logical  instructions, 
such  as  create-associative-segment  and  send,  perturb  the  ratio  to  approximately  2.5 
PiMac  instructions  per  logical  instruction. 

The  figures  in  this  table  are  estimates.  The  exact  costs  will  be  determined  by  a 
machine  dependent  compiler  running  a  typical  mixture  of  applications. 

Overall,  PiMac  provides  reasonable  support  for  the  interface.  With  a  40  ns  machine 
cycle  time,  PiMac  executes  approximately  six  million  Pi  operations  per  second  (5.7 
MPOPS). 

The  greatest  unresolved  issue  is  the  hit  rate  of  the  segment  buffers.  In  the  gate  array 
implementation  of  PiMac,  going  off-chip  requires  16  memory  cycles  (720  ns).  If  the 
hit  rate  is  not  high,  much  of  the  performance  gains  of  PiMac  will  be  lost.  It  is  likely 
that  the  1000  words  of  on-chip  (low  latency)  memory  will  be  employed  as  a  segment 
cache.  Future  research  will  provide  the  answers. 


■’This  is  not  a  general  requirement  of  Pi  substrates.  The  emphasis  is  on  efficient  support  of 
parallel  operations,  not  “one  instruction  per  Pi  operation” . 
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Chapter  7 
Conclusion 


This  thesis  addresses  the  problem  of  monolithic  parallel  architectures.  It  defines  Pi, 
a  parallel  architecture  interface,  to  separate  model  and  machine  issues.  The  interface 
is  based  on  several  generic  parallel  model  requirements:  low  latency  communication, 
fast  task  switching,  low  cost  synchronization,  efficient  storage  management,  the  ability 
to  exploit  all  types  of  locality,  and  the  ability  to  efficiently  support  sequential  code 
segments  in  a  parallel  environment.  This  foundation  is  the  central  contribution  of 
this  interface  design. 

The  interface  provides  many  features  for  model  mechanism  construction  including: 

•  Linear-addressed  segments  (e.g.,  read,  write) 

•  Associatively  addressed  segments  (e.g.,  insert,  match) 

•  Common  synchronization  primitives  (data,  barrier,  producer-consumer) 

•  Communication,  both  for  data  transport  and  remote  task  invocation 

•  Task  management,  including  prioritized  scheduling,  local  or  remote  task  execu¬ 
tion.  voluntary  suspension,  and  variable  argument  passing 

•  A  common  namespace  for  storage,  synchronizations,  and  tasks 

•  Topology-independent  specification  of  locality  (e.g.,  another-node-id,  distance) 

•  Efficient  sequential  code  sequences 

The  interface  is  used  to  construct  several  key  model  mechanisms  including:  shared 
memory  (with  caches),  set  synchronization,  object  name  translation,  and  non-resident 
handler  support.  In  addition,  two  applications  which  incorporate  multiple  model 
mechanisms  are  built:  n-body  and  relaxation.  The  examples  are  executed  using 
PiSim,  a  Pi  simulation  and  metering  environment.  Their  behavior  and  performance, 
in  terms  of  basic  parallel  operations,  is  determined.  Using  composite  statistics  from 
all  of  the  executions,  a  first-order  profile  of  operation  frequency  is  derived. 
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The  information  from  these  examples  is  used  in  the  specification  of  PiMac,  a  Pi  ma¬ 
chine  architecture.  This  design  is  also  driven  by  the  generic  parallel  model  require¬ 
ments  identified  earlier.  The  specific  design  goals  for  PiMac  include:  low  latency 
communication,  fast  task  switching  (suspensions,  queuing,  and  resumptions),  multi¬ 
ple  task/message  queues,  support  for  attributes  (synchronization),  efficient  storage 
management,  support  for  read/write  and  associative  segments,  and  fast  sequential 
execution. 

PiMac  is  fine-grained,  supporting  a  maximum  of  65  thousand  words  of  memory  and 
255  tasks  per  node.  It  provides  fast  microsequences  for  task  suspension  and  de¬ 
queuing  ('4  memory  cycles  for  each).  Segment  buffers  play  a  significant  role  in  the 
support  of  several  Pi  features  including  read-write  segments,  associative  segments, 
and  attributes.  PiMac  also  contains  an  autonomous  network  router  which  provides 
low  latency  communication  with  other  nodes. 

The  design  of  PiMac  has  been  significantly  influenced  by  the  Message  Driven  Pro¬ 
cessor.  A  gate-array  implementation  of  PiMac  is  considered  based  on  data  from  the 
MDP’s  design.  The  PiMac  core  is  expected  to  require  '30,000  gates.  Existing  tech¬ 
nology  can  easily  accommodate  this  design,  plus  a  modest-sized  segment  cache  (1000 
word)  to  further  reduce  memory  latency. 

Pi  performance  on  PiMac  is  evaluated.  Most  Pi  operations  have  direct  and  efficient 
realizations  on  PiMac.  An  unweighted  average  Pi  operation  requires  4.6  PiMac  in¬ 
structions  and  6.9  PiMac  machine  cycles.  Using  the  operation  frequencies  from  the 
model  mechanism  examples,  the  weighted  average  instruction  requires  2.5  PiMac  in¬ 
structions  and  4.4  PiMac  machine  cycles.  Based  on  this  weighted  average,  a  PiMac 
implementation  with  a  40  ns  machine  clock  executes  5.7  million  Pi  operations  per 
second. 


Why  Pi? 

This  thesis  introduces  Pi,  a  parallel  architecture  interface.  Pi  provides  several  benefits 
for  parallel  architectures  including: 

•  A  well-defined  boundary  separating  model  and  machine  issues 

•  A  platform  for  combining  multiple  model  mechanisms  to  address  the  require¬ 
ments  of  specific  applications 

•  An  evaluation  medium  for  architecture  components  with  explicit  parallel  oper¬ 
ation  costs 

•  An  efficient  vehicle  for  delivering  new  machine  techniques  and  technology  to 
existing  programming  environments 

A  key  feature  of  Pi  is  that  it  is  designed  specifically  to  be  an  architecture  interface. 
It  has  no  additional  purposes  that  could  compromise  its  effectiveness  as  \n  inter¬ 
face.  This  is  not  true  of  language-centered  interfaces,  machine- centered  interfaces. 
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and  theoretical  models,  which  have  other  goals.  Would  a  language  designer  remove 
a  model-biasing  feature  if  it  made  programming  easier?  Would  a  machine  designer 
neglect  a  model-biasing  mechanism  if  it  provided  a  signihcant  performance  improve¬ 
ment?  Using  Pi,  both  the  language  and  machine  designer  can  pursue  a  single  goal. 
The  interface  defines  a  context  for  their  separate  problems. 

By  focusing  on  basic  parallel  requirements,  Pi  provides  a  model  and  machine  inde¬ 
pendent  platform  with  which  to  build  parallel  systems.  This  lack  of  bias  improves 
Pi’s  generality  and  increases  its  chances  of  compatibility  with  future  models.  No  one 
knows  what  the  programming  model  of  the  future  will  be,  but  it  will  probably  share 
the  same  basic  parallel  requirements  of  existing  models:  communication,  synchro¬ 
nization,  locality,  linear  and  associative  storage,  and  sequential  sequences. 

In  the  introduction,  four  interface  evaluation  criteria  were  defined:  interface  abstract¬ 
ness,  presentation  accuracy,  functional  match,  and  implementation  bias.  In  the  course 
of  this  thesis,  these  criteria  have  guided  the  design  of  Pi.  As  model  mechanisms  were 
developed,  Pi  operations  that  restricted,  or  complicated  implementation  were  removed 
and  new  ones  were  added.  In  the  design  of  PiMac,  the  interface  was  examined  from 
the  other  side  and  adjustments  were  made. 

It  is  difficult  to  make  a  quantitative  assessment  of  these  criteria.  However,  through 
the  evaluation  and  refinement  in  this  thesis.  Pi  has  achieved  a  successful  balance 
between  models  and  machines. 

Future  Directions 

Pi  is  a  starting  point.  There  is  a  large  agenda  of  future  research  including  the  follow¬ 
ing. 

•  In  order  to  get  a  complete  and  accurate  representation  of  parallel  programs, 
programming  language  compilers  must  be  written  or  modified  to  produce  Pi 
programs.  Using  the  output  of  these  compilers,  a  larger  and  more  varied  set 
of  examples  can  be  collected.  This  also  requires  the  construction  of  a  complete 
set  of  model  mechanisms  to  support  each  programming  language.  Executing 
these  programs  will  provide  statistics  on  model  mechanism  usage  as  well  as  Pi 
operations. 

•  A  machine  dependent  compiler  and  operating  system  must  be  constructed  to 
provide  a  more  accurate  measure  of  Pi  operation  costs.  This  is  part  of  a  larger 
study  into  the  cost  of  supporting  generic  parallel  operations.  This  research 
must  also  address  several  open  research  areas  like  resource  management  in  a 
fine-grained  execution  environment.  New  modes  of  compiler  and  operating  sys¬ 
tem  interaction  are  required  (e.g.,  post-execution  recompilation  using  collected 
statistics). 

•  A  PiMac  prototype  must  be  constructed  to  accomplish  two  goals:  (a)  to  support 
execution  of  larger  and  more  realistic  programming  examples,  and  (b)  to  provide 


a  performance  incentive  for  people  outside  computer  architecture  who  have 
problems  they’d  like  to  solve  faster  and  big  problems  they  can’t  solve  at  all. 
The  latter  goal  is  necessary  if  parallel  computers  are  ever  to  become  more  than 
academic  curiosity. 

•  New  programming  environments  that  take  advantage  of  the  multi-model  ca¬ 
pability  of  Pi  must  be  developed.  These  environments  allow  problem  solving 
paradigms  to  be  selected  based  on  the  requirements  of  application,  rather  than 
machine  or  language  availability. 

Certainly  parallel  architecture  interfaces  will  continue  to  evolve  as  this  work  pro¬ 
gresses.  This  thesis  advocates  the  use  of  an  interface  based  on  generic  parallel  model 
requirements.  Pi  is  a  first  step  towards  that  goal. 
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Appendix  A 

Example  Source  Code 


This  appendix  contains  the  complete  code  for  the  examples  described  in  chapter  3. 


A.l  Common  Handlers 


Common 


; ;  Scott  Wills  14  September  1989 

I  I 

;;  These  are  common  handlers  for  the  Pi  examples. 

(in-package  ’user) 

: :  These  constants  define  offsets  to  the  Length,  Priority  and 
;;  Task-Type  fields  in  a  task  segment. 

(def ine-const2ait  LENGTH  0) 

(def ine-constant  PRIORITY  1) 

(def ine-constant  TASK-TYPE  2) 

(oaf ine-constant  LOCALS-OFFSET  3) 

: ;  This  handler  stores  a  value  in  a  read-»rite  segment  at  a  specified 
; ;  offset . 

(def ine-handler  Reply-Value  (Value  Segment  Offset)  () 

(write  (read  (self)  Segment)  (read  (self)  Offset)  (read  (self)  Value)) 
(destroy-segment  (self))) 

This  handler  adjusts  a  barrier  by  a  count. 

(def ine-handler  Adjust-Barrier  (Count-Change  Segment  Offset)  () 
(Adjust-Count  (read  (self)  Segment)  (read  (seif)  Offset) 

(read  (self)  Count-Change)) 

(destroy-segment  (self))) 

;;  This  handler  adds  a  value  to  a  location  sind  reduces  a  b-sync.  Both 
;;  the  add  location  and  b-sync  must  be  in  the  same  read-write  segment, 
(def ine-h2U»dler  Add-And-Reduce  (Add-Value  Segment  B-Sync-Of f set 

Add-Offset)  0 

(write  (read  (self)  Segment) 

(read  (self)  Add-Offset) 

(plus  (read  (read  (self)  Segment)  (read  (self)  Add-Offset)) 
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(read  (self)  Add-Value))) 

(Adjust-Count  (read  (self)  Segment)  (read  (self)  B-Sync-Off set)  -1) 
(destroy-segment  (self))) 


A. 2  Shared  Memory  With  Caches 


Shared  Memory 


Scott  Wills  9  October  1989 

This  is  shared  memory  model  example  uith  caches. 

There  eore  tno  protocols  in  this  example.  One  is  for  a  line  of  a 
chunk.  The  other  is  for  a  line  of  a  cache.  Both  protocols  are 
defined  belov. 

Chunk  Protocol 


Chunk  States:  UNLOCKED,  READ-ONLY,  READ-WRITE,  LOCKED-REMOVE , 
LOCKED-READ,  LOCKED-WRITE ,  LOCKED-FLUSH-READ ,  LOCKED-FLUSH-WRITE , 
LOCKED-INVALIDaTE-WRITE,  ERROR 

Chunk  Handlers;  Read,  Write,  Remove  Ack,  Ack-Flush 
UNLOCKED 

Read:  send  Reply;  goto  LOCKED-READ. 

Write:  send  Reply;  goto  LOCKED-WRITE. 

Remove:  goto  ERROR. 

Ack:  goto  ERROR. 

Ack-Flush:  goto  ERROR. 

READ-ONLY 

Read:  send  Reply;  add  to  chain;  goto  LOCKED-READ. 

Write:  send  Invalidate;  goto  LOCKED-INVALIDATE-WRITE. 

Remove;  send  Invalidate;  goto  LOCKED-REMOVE. 

Ack;  goto  ERROR. 

Ack-Flush:  goto  ERROR. 

READ-WRITE 

Read:  send  Flush;  goto  LOCKED-FLUSH-READ. 

Write:  send  Flush;  goto  LOCKED-FLUSH-WRITE. 

Remove;  send  Flush;  goto  LOCKED-REMOVE. 

Ack:  goto  ERROR. 

Ack-Flush:  goto  ERROR. 

LOCKED-REMOVE 

Read:  suspend  task;  goto  LOCKED-REMOVE. 
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Write:  suspend  task;  goto  LOG(ED-R£MOVE. 

Remove:  suspend  task;  goto  LOCKED -REMOVE. 

Ack:  requeue  suspended  tasks;  goto  UHLOCKEO. 

Ack-Flush:  update  line;  requeue  suspended  tasks;  goto  UVLOCKED. 
LOCKED-READ 

Read:  suspend  task;  goto  LOCKED-READ. 

Write:  suspend  task;  goto  LOCKED-READ. 

Remove:  suspend  task;  goto  LOCKED-READ. 

Ack:  requeue  suspended  tasks;  goto  READ-ONLY. 

Ack-Flush:  goto  ERROR. 

LOCKED-WRITE 

Read:  suspend  task;  goto  LOCKED-WRITE. 

Write:  suspend  task;  goto  LOCKED-WRITE. 

Remove:  suspend  task;  goto  LOCKED-WRITE. 

Ack:  requeue  suspended  tasks;  goto  READ-WRITE. 

Ack-Flush :  goto  ERROR . 

LOCKED-FLUSH-READ 

Read:  suspend  task;  goto  LOCKED-FLUSH-READ. 

Write:  suspend  task;  goto  LOCKED-FLUSH-READ. 

Remove:  suspend  task;  goto  LOCKED-FLUSH-READ. 

Ack:  goto  ERROR. 

Ack-Flush: 

LOCKED-FLUSH-WRITE 

Read:  suspend  task;  goto  LOCKED-FLUSH-WRITE. 

Write:  suspend  task;  goto  LOCKED-FLUSH-WRITE. 

Remove:  suspend  task;  goto  LOCKED-FLUSH-WRITE. 

Ack:  goto  ERROR. 

Ack-Flush:  update  line;  goto  LOCKED-WRITE. 
LOCKED-INVALIDATE-WRITE 

Read:  suspend  task;  goto  LOCKED-INVALIDATE-WRITE. 

Write:  suspend  task;  goto  LOCKED-INVALIDATE-WRITE. 

Remove:  suspend  task;  goto  LOCKED-INVALIDATE-WRITE. 

Ack:  goto  LOCKED-WRITE. 

Ack-Flush:  goto  ERROR. 

Cache  Protocol 


Cache  States:  INVALID,  READ-ONLY,  READ-WRITE,  LOCKED-READ, 
LOCKED-WRITE,  LOCKED-READ-ONLY-READ ,  LOCKED-READ-ONLY-WRITE , 
LOCKED-READ-WRITE-READ,  LOCKED-READ-WRITE-WRITE ,  ERROR 

Cache  Handlers:  Read-Same,  Write-Same,  Read-Different, 
Write-Different,  Reply,  Invalidate,  Flush 

INVALID 

Read-Same:  send  Read;  goto  LOCKED-READ. 

Write-Same:  send  Write;  goto  LOCKED-WRITE. 

Read-Different:  send  Read;  goto  LOCKED-READ. 
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Write-Dillerent:  send  Write;  goto  LOCKEO-WRITE . 

Reply:  goto  ERROR. 

Invalidate:  goto  ERROR. 

Flush:  goto  ERROR. 

READ-ONLY 

Read-Sane:  read  value;  goto  READ-ONLY. 

Write-Sane:  send  renove;  goto  LOCKED-READ-ONLY-WRITE . 

Read-Different:  send  renove;  goto  LOCXED-READ-ONLY-READ . 

Write-Diff erent:  send  remove;  goto  LOCKED-READ-ONLY-WRITE. 

Reply:  goto  ERROR. 

Invalidate:  send  Invalidate  or  Ack;  goto  INVALID. 

Flush:  goto  ERROR. 

READ-WRITE 

Read-Seune:  read  value;  goto  READ-WRITE. 

Write-Same:  write  value;  goto  READ-WRITE. 

Read-Different:  send  remove;  goto  LOCKED-READ-WRITE-READ . 
Write-Different:  send  remove;  goto  LOCKED-READ- WRITE-WRITE. 

Reply:  goto  ERROR. 

Invalidate:  goto  ERROR. 

Flush:  send  Ack-Flush;  goto  INVALID. 

LOCKED-READ 

Read-Same:  suspend  task;  goto  LOCKED-READ. 

Write-Same:  suspend  task;  goto  LOCKED-READ. 

Read-Different:  suspend  task;  goto  LOCKED-READ. 

Write-Different:  suspend  task;  goto  LOCKED-READ. 

Reply:  send  Ack;  update  line;  requeue  suspended  tasks;  goto  READ-ONLY. 
Invalidate:  goto  ERROR. 

Flush:  goto  ERROR. 

LOCKED-WRITE 

Read-Same:  suspend  task;  goto  LOCKED-WRITE. 

Write-S^une:  suspend  task;  goto  LOCKED-WRITE. 

Read-Different:  suspend  task;  goto  LOCKED-WRITE. 

Write-Different:  suspend  task;  goto  LOCKED-WRITE. 

Reply:  send  Ack;  update  cache;  requeue  suspended  tasks; 

goto  READ-WRITE. 

Invalidate:  goto  ERROR. 

Flush:  goto  ERROR. 

LOCKED-READ-ONLY-READ 

Read-Same:  read  value;  goto  LOCKED-READ-ONLY-READ. 

Write-Same:  suspend  task;  goto  LOCKED-READ-ONLY-READ. 

Read-Different:  suspend  task;  goto  LOCKED-READ-ONLY-READ. 
Write-Different:  suspend  task;  goto  LOCXED-READ-ONLY-READ. 

Reply:  goto  ERROR. 

Invalidate:  send  Ack;  change  name;  goto  LOCKED-READ. 

Flush:  goto  ERROR. 

LOCKED-READ-ONLY-WRITE 

Read-Same:  read  value;  goto  LOCKED-READ-ONLY-WRITE. 

Write-Same:  suspend  task;  goto  LOCKED-READ-ONLY-WRITE. 
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;;  Read-Diflerent ;  suspend  task;  goto  LOCKED-READ-ONLY-URITE. 

Write-Dillerent:  suspend  task;  goto  LOCKED-READ-OHLY-WRITE . 

; ;  Reply:  goto  ERROR. 

;;  Invalidate:  send  Ack;  change  name;  goto  LOCKED-WRITE . 

; ;  Flush:  goto  ERROR. 

>  > 

: :  LOCKED-READ-WRITE-READ 

; ;  Read-Same :  read  value ;  goto  LOCKED-READ-WRITE-READ . 

;;  Write-Same:  erite  value;  goto  LOCKED-READ-WRITE-READ. 

;;  Read-Different:  suspend  task;  goto  LOCKED-READ-WRITE-READ. 

;;  Write-Different:  suspend  task;  goto  LOCKED-READ-WRITE-READ. 

; ;  Reply:  goto  ERROR. 

;;  Invalidate:  goto  ERROR. 

;;  Flush:  send  Ack-Flush;  change  name;  goto  LOCKED-READ. 

I  * 

: ;  LOCKED-READ-WRITE-WRITE 

;;  Read-Same:  read  value;  goto  LOCKED-READ-WRITE-WRITE. 

: ;  Write-Same:  write  value;  goto  LOCKED-READ-WRITE-WRITE. 

:;  Read-Different:  suspend  task;  goto  LOCKED-READ-WRITE-WRITE. 

;;  Write-Different:  suspend  task;  goto  LOCKED-READ-WRITE-WRITE. 

;;  Reply:  goto  ERROR. 

;;  Invalidate:  goto  ERROR. 

; ;  Flush:  send  Ack-Flush;  change  name;  goto  LOCKED-WRITE. 

(in-package  ’user) 

; ;  This  constant  defines  the  number  of  nodes. 

(def ine-constant  Mumber-Of-Kodes  64) 

;;  This  constant  defines  the  number  of  lines  in  a  node  chunk. 

(def ine-constant  Chunk-Lines  8) 

;;  This  constant  defines  the  base  mask  (minus  the  line  offset). 

(def ine-constant  Base-Mask  4088)  ;;  (4096  -  8) 

; ;  This  constant  defines  the  offset  mask  (minus  the  line  base) . 

(def ine-constant  Offset-Mask  7) 

; ;  This  constant  defines  the  shift  to  move  the  node  to  the  bottom  of 
; ;  the  word. 

(def ine-constant  Mode-Shift  -6) 

; ;  This  constcint  defines  the  node  mask. 

(def ine-constant  Node-Mask  63) 

; ;  This  constant  defines  the  shift  to  move  the  index  to  the  bottom  of  the 
: :  word. 

(def ine-constant  Index-Shift  -3) 

; ;  This  constant  defines  the  index  mask. 

(def ine-constant  Index-Mask  7) 

; ;  This  constant  defines  the  size  of  the  cache  map. 

(def ine-constant  Cache-Map-Size  16) 

; ;  These  constants  specify  offsets  into  a  line. 

(def ine-constamt  Store-Offset  0) 

(def ine-constant  Status-Offset  8) 

(def ine-constant  Link-Offset  9) 

(def ine-constant  Request-Offset  10) 

(def ine-constant  Barrier-Offset  11) 

(def ine-constant  Line-Size  12) 

; ;  This  nodal  vectors  the  map  of  chunk  objects  on  a  node. 

(def ine-nodal  Chunk-Map) 

; ;  This  nodal  vectors  the  cache  map. 
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(deline-nodal  Cache-Map) 

;;  This  handler  initializes  shared  memory  in  the  entire  system, 
(deline-handler  Shared-Memory-Setup  ()  (Barrier  Index) 

(attribute  (sell)  Bao^rier  B-SYHC) 

(adjust-count  (sell)  Barrier  Humber-Ol-Nodes) 

(urite  (sell)  Index  Mumber-Ql -lodes) 

Loop 

(urite  (sell)  Index  (minus  (read  (sell)  Index)  1)) 

(send  (read  (sell)  Index)  9  HORMAL  Mode-Setup  (node-id)  (sell)  Barrier) 
(branch-not-zero  (read  (sell)  Index)  Loop) 

(test-count  (sell)  Barrier) 

(destroy- segment  (sell))) 

;;  This  handler  sets  up  a  node  to  support  a  shared  memory  model.  It 
: ;  allocates  and  initializes  the  shared  memory  chunk  and  cache 
; ;  storage. 

(deline-handler  Mode-Setup  (Ack-Mode  Ack-Segment  Ack-Ollset) 

(Index-A  Index-B  Line) 

(vrite  (nodals)  Chunk-Map  (create-read-write-segment  Chunk-Lines)) 

(write  (sell)  Index-A  Chunk-Lines) 

Loop-1 

(write  (sell)  Index-A  (minus  (read  (sell)  Index-A)  1)) 

(write  (sell)  Line  (create-read-write-segment  Line-Size)) 

(write  (sell)  Index-B  (plus  Store-Ollset  Line-Size)) 

Loop-2 

(write  (sell)  Index-B  (minus  (read  (sell)  Index-B)  1)) 

(write  (read  (sell)  Line)  (read  (sell)  Index-B)  0) 

(branch-not-zero  (compare  (read  (sell)  Index-B)  Store-Ollset)  Loop-2) 
(write  (read  (sell)  Line)  Status-Ollset  UMLOCXED) 

(write  (read  (sell)  Line)  Link-Ollset  EMD) 

(attribute  (read  (sell)  Line)  Barrier-Ollset  B-SYMC) 

(write  (read  (nodals)  Chunk-Map)  (read  (sell)  Index-A) 

(read  (sell)  Line)) 

(branch-not-zero  (read  (sell)  Index-A)  Loop-1) 

(write  (nodals)  Cache-Map 

(create-associative-segment  Cache-Map-Size  BOUMD-SAFE)) 

(send  (read  (sell)  Ack-Mode)  6  NORMAL  Adjust-Barrier 

-1  (read  (sell)  Ack-Segment)  (read  (sell)  Ack-Ollset)) 
(destroy-segment  (sell))) 

; ;  This  handler  executes  a  "read"  code  Iragment . 

(deline-handler  Read-Address  (Address)  (Base  Ollset  Line) 

(write  (sell)  Base  (and  (read  (sell)  Address)  Base-Mask)) 

(write  (sell)  Ollset  (and  (read  (sell)  Address)  Ollset-Mask) ) 

(write  (sell)  Line  (match  (read  (nodals)  Cache-Nap)  (read  (sell)  Base))) 
(branch-zero  (compare  (read  (sell)  Line)  UMBOUMD)  Get -Line) 

(branch-zero  (compare  (read  (read  (sell)  Line)  Status-Ollset)  INVALID) 
Get-Line) 

(return  (read  (read  (sell)  Line)  (read  (sell)  Ollset))) 

(destroy-segment  (sell)) 

Get-Line 

(write  (sell)  Line  (call  8  Get-Cache-Line  (read  (sell)  Base)  READ)) 
Read-Line 

(return  (read  (read  (sell)  Line)  (read  (sell)  Ollset))) 

(attribute  (read  (sell)  Line)  Request-Ollset  READ-WRITE) 

(destroy-segment  (sell))) 
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; ;  This  handler  executes  a  "vrite"  code  fragment . 

(deline-handler  Write-Address  (Address  Value)  (Base  Offset  Line) 

(write  (self)  Base  (and  (read  (self)  Address)  Base-Mask)) 

(write  (self)  Offset  (and  (read  (self)  Address)  Offset-Mask)) 

(write  (self)  Line  (match  (read  (nodals)  Cache-Map)  (read  (self)  Base))) 
(branch-zero  (compare  (read  (self)  Line)  UMBOUBD)  Get-Line) 
(branch-not-zero  (compare  (read  (read  (self)  Line)  Status-Offset) 
READ-WRITE)  Get-Line) 

(write  (read  (self)  Line)  (read  (self)  Offset)  (read  (self)  Value)) 
(destroy-segment  (self)) 

Get-Line 

(write  (self)  Line  (call  8  Get-Cache-Line  (read  (self)  Base)  WRITE)) 
(write  (read  (self)  Line)  (read  (self)  Offset)  (read  (self)  Value)) 
(attribute  (read  (self)  Line)  Request-Offset  READ-WRITE) 

(destroy-segment  (self))) 

;;  This  handler  loads  a  line  into  the  cache  in  the  specified  mode. 

;;  It  may  be  necessary  to  remove  an  existing  cache  line  first.  It 
; ;  may  also  be  necessary  to  create  an  empty  cache  line  if  no  cache 
; ;  exits . 

(def ine-handler  Get-Cache-Line  (Base  Mode)  (Current-Base  Node  Line) 

(write  (self)  Current-Base 

(insert  (read  (nodals)  Cache-Map)  (read  (self)  Base)  UNBOUND)) 
(write  (self)  Line 

(match  (read  (nodals)  Cache-Map)  (read  (self)  Current-Base))) 
(branch-not-zero  (compare  (read  (self)  Line)  UNBOUND)  Remove-Line) 

(write  (self)  Line  (create-read-write-segment  Line-Size)) 

(write  (read  (self)  Line)  Status-Offset  INVALID) 

(insert  (read  (nodals)  Cache-Map)  (read  (self)  Base)  (read  (self)  Line)) 
Remove-Line 

(write  (read  (self)  Line)  Request-Offset  (read  (self)  Base)) 

(attribute  (read  (self)  Line)  Request-Offset  READ-ONLY) 

(write  (self)  Current-Base 

(insert  (read  (nodals)  Cache-Map)  (read  (self)  Base)  UNBOUND)) 
(branch-zero  (compare  (read  (read  (self)  Line)  Status-Offset)  INVALID) 
Get-Line) 

(attribute  (read  (self)  Line)  Barrier-Offset  B-SYNC) 

(adjust-count  (read  (self)  Line)  Barrier-Offset  1) 

(write  (self)  Node  (a-shift  (read  (self)  Current-Base)  Node-Shift)) 
(write  (self)  Node  (and  (read  (self)  Node)  Node-Mask)) 

(send  (read  (self)  Node)  €  NORMAL  Remove  (read  (self)  Current-Base)) 
(test-count  (read  (self)  Line)  Barrier-Offset) 

Get-Line 

(remove  (read  (nodals)  Cache-Map)  (read  (self)  Current-Base)) 

(insert  (read  (nodals)  Cache-Map)  (read  (self)  Base)  (read  (self)  Line)) 
(attribute  (read  (self)  Line)  Barrier-Offset  B-SYNC) 

(adjust-count  (read  (self)  Line)  Barrier-Offset  1) 

(write  (self)  Node  (a-shift  (read  (self)  Base)  Node-Shift)) 

(write  (self)  Node  (and  (read  (self)  Node)  Node-Mask)) 

(send  (read  (self)  Node)  7  NORMAL  (read  (self)  Mode) 

(read  (self)  Base)  (node-id)) 

(test-count  (read  (self)  Line)  Barrier-Offset) 

(return  (read  (self)  Line)) 

(destroy-segment  (self))) 


; ;  Chunk  Handlers 

; ;  This  hstndler  reads  a  line  Iron  chunk  memory.  First  the  chunk  line 
; ;  status  is  tested,  and  the  lock  is  acquired.  If  the  status  is 
;;  READ-WRITE,  the  exclusive  copy  is  flushed.  Then  a  READ-ONLY  copy 
,;  is  sent  to  the  requesting  node,  and  it  is  pushed  on  the  READ-ONLY 
; ;  chain.  When  the  acknowledgment  is  received,  the  chunk  line  is 
; ;  unlocked. 

(def ine-handler  Read  (Base  Reply-Node)  (Index  Old-Status) 

(write  (self)  Index  (a-shift  (read  (self)  Base)  Index-Shift)) 

(write  (self)  Index  (and  (read  (self)  Index)  Index-Maisk)) 

(write  (self)  Index  (read  (read  (nodals)  Chunk-Map)  (read  (self)  Index))) 
(write  (self)  Old-Status  (read  (read  (self)  Index)  Status-Offset)) 
(attribute  (read  (self)  Index)  Status-Offset  D-SYNC) 

(branch-not-zero  (compare  (read  (self)  Old-Status)  READ-WRITE)  Skip-Flush) 
(adjust-count  (read  (self)  Index)  Barrier-Offset  1) 

(send  (read  (read  (self)  Index)  Link-Offset)  6  NORMAL  Flush 
(read  (self)  Base)) 

(test-count  (read  (self)  Index)  Barrier-Offset) 

Skip-Flush 

(adjust-count  (read  (self)  Index)  Barrier-Offset  1) 

(send  (read  (self)  Reply-Node)  16  NORMAL  Reply 

(read  (read  (self)  Index)  0)  (read  (road  (self)  Index)  1) 

(read  (read  (self)  Index)  2)  (read  (road  (self)  Index)  3) 

(road  (read  (self)  Index)  4)  (read  (read  (self)  Index)  5) 

(read  (read  (self)  Index)  6)  (read  (read  (self)  Index)  7) 

READ-ONLY  (read  (read  (self)  Index)  Link-Offset)  (read  (self)  Base)) 
(write  (road  (self)  Index)  Link-Offset  (road  (self)  Reply-Node)) 
(test-count  (road  (self)  Index)  Barrier-Offset) 

(write  (read  (self)  Index)  Status-Offset  READ-ONLY) 

(destroy-sogment  (self))) 

This  handler  replies  an  exclusive  READ-WRITE  copy  of  a  line.  First 
;;  the  chunk  line  status  is  tests,  and  the  lock  is  acquired.  If  the 

;;  status  is  READ-WRITE,  the  exclusive  copy  is  flushed.  If  the 

;;  status  is  READ-ONLY,  the  READ-ONLY  chain  is  invalidated.  Then  2ui 

;;  exclusive  READ-WRITE  copy  is  sent  to  the  requesting  node,  emd  the 

;;  link  field  is  set.  When  the  acknowledgment  is  received,  the 
; ;  chunk  line  is  unlocked. 

(def ine-handler  Write  (Base  Reply-Node)  (Index  Old-Status) 

(write  (self)  Index  (a-shift  (read  (self)  Base)  Index-Shift)) 

(write  (self)  Index  (and  (road  (self)  Index)  Index-Mask)) 

(write  (self)  Index  (read  (read  (nodsds)  Chunk-Map)  (road  (self)  Index))) 
(write  (self)  Old-Status  (read  (road  (self)  Index)  Status-Offset)) 
(attribute  (read  (self)  Index)  Status-Offset  D-SYNC) 

(branch-not-zero  (compare  (road  (self)  Old-Status)  READ-WRITE)  Skip-Flush) 
(adjust-count  (read  (self)  Index)  Barrier-Offset  1) 

(send  (road  (read  (self)  Index)  Link-Offset)  6  NORMAL  Flush 
(read  (self)  Base)) 

(test-count  (read  (self)  Index)  Barrier-Offset) 

Skip-Flush 

(branch-not-zero  (compare  (read  (self)  Old-Status)  READ-ONLY) 
Skip-Invalidate) 

(adjust-count  (read  (self)  Index)  Barrier-Offset  1) 

(send  (road  (read  (self)  Index)  Link-Offset)  6  NORMAL  Invalidate 
(read  (self)  Base)) 
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(test-count  (read  (self)  Index)  Barrier-Offset) 

Skip-Invalidate 

(adjust-count  (read  (self)  Index)  Barrier-Offset  1) 

(send  (read  (self)  Reply-Hode)  16  HORMAL  Reply 

(read  (read  (self)  Index)  0)  (read  (read  (self)  Index)  1) 

(read  (read  (self)  Index)  2)  (read  (read  (self)  Index)  3) 

(read  (read  (self)  Index)  4)  (read  (read  (self)  Index)  5) 

(read  (read  (self)  Index)  6)  (read  (read  (self)  Index)  7) 

READ-WRITE  EHD  (read  (self)  Base)) 

(write  (read  (self)  Index)  Link-Offset  (read  (self)  Reply-Hode)) 
(test-count  (read  (self)  Index)  Barrier-Offset) 

(write  (read  (self)  Index)  Status-Offset  READ-WRITE) 

(destroy-segment  (self))) 

;;  This  handler  removes  a  cache  copy  from  a  node.  If  the  line  status 
;;  is  UNLOCKED,  no  action  is  taken.  If  the  status  is  READ-ONLY,  all 
;;  read-only  copies  are  invalidated,  and  the  link  is  set  to  END.  If  the 
; ;  status  is  READ-WRITE,  the  exclusive  copy  is  flushed,  and  the  link  is  set 
: :  to  END . 

(def ine-handler  Remove  (Base)  (Index  Old-Status) 

(write  (self)  Index  (a-shift  (read  (self)  Base)  Index-Shift)) 

(write  (self)  Index  (and  (read  (self)  Index)  Index-Mask)) 

(write  (self)  Index  (read  (read  (nodals)  Chunk-Map)  (read  (self)  Index))) 
(write  (self)  Old-Status  (read  (read  (self)  Index)  Status-Offset)) 
(attribute  (read  (self)  Index)  Status-Offset  D-SYNC) 

(branch-not-zero  (compare  (read  (self)  Old-Status)  READ-WRITE)  Skip-Flush) 
(adjust-count  (read  (self)  Index)  Barrier-Offset  1) 

(send  (read  (read  (self)  Index)  Link-Offset)  6  NORMAL  Flush 
(read  (self)  Base)) 

(test-count  (read  (self)  Index)  Barrier-Offset) 

Skip-Flush 

(branch-not-zero  (compare  (read  (self)  Old-Status)  READ-ONLY) 

Sk ip- In val idat  e ) 

(adjust-count  (read  (self)  Index)  Barrier-Offset  1) 

(send  (read  (read  (self)  Index)  Link-Offset)  6  NORMAL  Invalidate 
(read  (self)  Base)) 

(test-count  (read  (self)  Index)  Barrier-Offset) 

Skip-Invalidate 

(write  (read  (self)  Index)  Link-Offset  EHD) 

(write  (read  (self)  Index)  Status-Offset  UNLOCKED) 

(destroy-segment  (self))) 

;;  This  handler  acknowledges  a  Reply  or  Invetlidate,  and  adjusts  the 
; ;  chunk  barrier. 

(def ine-h2uidler  Ack  (Base)  (Index) 

(write  (self)  Index  (a-shift  (read  (self)  Base)  Index-Shift)) 

(write  (self)  Index  (and  (read  (self)  Index)  Index-Mask)) 

(write  (self)  Index  (read  (read  (nodals)  Chunk-Map)  (read  (self)  Index))) 
(adjust-count  (read  (self)  Index)  Barrier-Offset  -1) 

(destroy-segment  (self))) 

; ;  This  handler  updates  the  chunk  data  from  a  flush  acknowledgment . 

; ;  It  then  adjusts  the  chunk  barrier. 

(def ine-handler  Ack-Flush  (Word-O  Word-1  Word-2  Word-3  Word-4  Word-5  Word -6 

Word-7  Base)  (Index) 

(write  (self)  Index  (a-shift  (read  (self)  Base)  Index-Shift)) 

(write  (self)  Index  (and  (read  (self)  Index)  Index-Mask)) 
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(write  (sell)  Index  (read  (read  (nodals)  Chunk-Map)  (read  (sell)  Index))) 
(write  (read  (sell)  Index)  0  (read  (sell)  Word-0)) 

(write  (read  (sell)  Index)  1  (read  (sell)  Word-1)) 

(write  (read  (sell)  Index)  2  (read  (sell)  Word-2)) 

(write  (read  (sell)  Index)  3  (read  (sell)  Word-3)) 

(write  (read  (sell)  Index)  4  (read  (sell)  Word-4)) 

(write  (read  (sell)  Index)  5  (read  (sell)  Word-5)) 

(write  (read  (sell)  Index)  6  (read  (sell)  Word-6)) 

(write  (read  (sell)  Index)  7  (read  (sell)  Word-7)) 

(write  (read  (sell)  Index)  Link-Ollset  EMD) 

(adjust-count  (read  (sell)  Index)  Barrier-Ollset  -1) 

(destroy-segment  (sell))) 


; ;  Cache  Handlers 

;;  This  handler  stores  a  line  (ten  words)  into  a  cache-line.  It  then 
; ;  adjusts  the  cache  line  barrier. 

(deline-handler  Reply  (Word-0  Word-1  Word-2  Word-3  Word-4  Word-5  Word-6 

Word-7  Status  Link  Base)  (Mode  Line) 

(write  (sell)  Mode  (a-shilt  (read  (sell)  Base)  Mode-Shilt)) 

(write  (sell)  Mode  (and  (read  (sell)  Mode)  Mode-Mask)) 

(send  (read  (sell)  Mode)  5  MORMAL  Ack  (read  (sell)  Base)) 

(write  (sell)  Line  (match  (read  (nodals)  Cache-Map)  (read  (sell)  Base))) 
(write  (read  (sell)  Line)  0  (read  (sell)  Word-0)) 

(write  (read  (sell)  Line)  1  (read  (sell)  Word-1)) 

(write  (read  (sell)  Line)  2  (read  (sell)  Word-2)) 

(write  (read  (sell)  Line)  3  (read  (sell)  Word-3)) 

(write  (read  (sell)  Line)  4  (read  (sell)  Word-4)) 

(write  (read  (sell)  Line)  5  (read  (sell)  Word-5)) 

(write  (read  (sell)  Line)  6  (read  (sell)  Word-6)) 

(write  (read  (sell)  Line)  7  (read  (sell)  Word-7)) 

(write  (read  (sell)  Line)  Status-Ollset  (read  (sell)  Status)) 

(write  (read  (sell)  Line)  Link-Ollset  (read  (sell)  Link)) 

(adjust-count  (read  (sell)  Line)  Barrier-Ollset  -1) 

(destroy-segment  (sell))) 

; ;  This  handler  Hushes  an  exclusive  read-write  cache  copy  and 
; ;  invalidates  the  entry.  It  then  adjusts  the  cache  line  barrier, 
(deline-handler  Flush  (Base)  (Mode  Line) 

(write  (sell)  Mode  (a-shilt  (read  (sell)  Base)  Mode-Shilt)) 

(write  (sell)  Mode  (and  (read  (sell)  Mode)  Mode-Mask)) 

(write  (sell)  Line  (match  (read  (nodals)  Cache-Map)  (read  (sell)  Base))) 
(write  (read  (sell)  Line)  Status-Ollset  IMVALID) 

(send  (read  (sell)  Mode)  13  MORMAL  Ack-Flush 

(read  (read  (sell)  Line)  0)  (read  (read  (sell)  Line)  1) 

(read  (read  (sell)  Line)  2)  (read  (read  (sell)  Line)  3) 

(read  (read  (sell)  Line)  4)  (read  (read  (sell)  Line)  5) 

(read  (read  (sell)  Line)  6)  (read  (re2td  (sell)  Line)  7) 

(read  (sell)  Base)) 

(adjust-count  (read  (sell)  Line)  Barrier-Ollset  -1) 

(destroy-segment  (sell))) 

;;  This  handler  invHidates  a  read-only  cache  copy  entry.  II  the 
;;  cache  link  is  EMD,  2uid  Ack  is  returned  to  the  chunk.  Otherwise, 

;;  the  invalidate  is  lorwarded  to  the  next  cache  copy  in  the  list. 

;;  The  cache  line  barrier  is  also  adjusted. 

(deline-handler  Invalidate  (Base)  (Mode  Line) 
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(vrite  (sell)  Line  (match  (read  (nodals)  Cache-Map)  (read  (sell)  Base))) 
(write  (read  (sell)  Line)  Status-Ollset  IIVALID) 

(adjust-count  (read  (sell)  Line)  Barrier-Ollset  -1) 

(br2mch-zero  (compare  (read  (read  (sell)  Line)  Link-Ollset)  END)  Ack-Chunk) 
(send  (read  (read  (.sell)  Line)  Link-Ollset)  6  NORMAL  Invalidate 
(read  (sell)  Base)) 

(destroy-segment  (sell)) 

Ack-Chunk 

(write  (sell)  Node  (a-shilt  (read  (sell)  Base)  Node-Shilt)) 

(write  (sell)  Node  (and  (read  (sell)  Node)  Node-Mask)) 

(send  (read  (sell)  Node)  ^  NORMAL  Ack  (read  (sell)  Base)) 

(destroy-segment  (sell))) 


A  3  Shared  Memory  With  Address  Braiding 


Shared  Memory 


; ;  Scott  Wills  14  September  1989 

I  • 

;;  This  is  shared  memory  mechanism  (without  caches)  writen  in  Pi. 

; ;  It  demonstrates  address  braiding. 

(in-package  ‘user) 

;;  This  constant  delines  the  number  ol  nodes. 

(deline-constant  Number-Ol-Nodes  64) 

; ;  This  is  the  node  variable  that  holds  the  pointer  to  the  shared  memory 
; ;  space . 

(deline-nodal  SM-Space) 

;;  This  handler  sets  up  a  node  to  support  a  shared  memory  model  (2,3). 

; ;  It  preallocates  all  ol  the  required  storage  in  a  read-write 
;;  segment.  The  segment  is  vectored  by  the  nodal:  SM-Space. 

; ;  This  handler  initializes  shared  memory  in  the  entire  system, 
(deline-handler  Shared-Memory-Setup  ()  (Barrier  Index) 

(attribute  (sell)  Bairrier  B-SYNC) 

(adjust-count  (sell)  Barrier  Number-Ol-Nodes ) 

(write  (sell)  Index  Number-Ol-Nodes) 

Loop 

(write  (sell)  Index  (minus  (read  (sell)  Index)  1)) 

(send  (read  (sell)  Index)  7  NORMAL  Node-Setup  (node-id)  (sell)  Barrier) 
(branch-not-zero  (read  (sell)  Index)  Loop) 

(test-count  (sell)  Barrier) 

(destroy-segment  (sell))) 

;;  This  handler  preallocates  1024  words  ol  storage  (64  16  word  blocks)  in  a 
; ;  segment  vectored  by  the  node  variable  SM-Space. 

(deline-handler  Node-Setup  (Ack-Node  Ack-Segment  Ack-Ollset)  (Segment) 
(write  (sell)  Segment  (create-read-write-segment  1024)) 

(write  (nodals)  SM-Space  (read  (sell)  Segment)) 

(send  (read  (sell)  Ack-Node)  6  NORMAL  Adjust-Barrier 

-1  (read  (sell)  Ack-Segment)  (read  (sell)  Ack-Ollset)) 
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(destroy-segment  (sell))) 

;;  This  handler  reads  a  value  and  replies  the  content. 

(deline-handler  Read  (Ollset  Reply-Hode  Reply-Segment  Reply-Ollset)  () 

(send  (read  (sell)  Reply-Hode)  6  NORMAL  Reply-Value 

(read  (.read  (nodals)  SM-Space)  (read  (sell)  Ollset)) 

(read  (sell)  Reply-Segment)  (read  (sell)  Reply-Ollset)) 
(destroy-segment  (sell))) 

;;  This  handler  writes  a  value.  An  acknowledgment  is  returned, 
(deline-handler  Write  (Ollset  Value  Ack-Iode  Ack-Segment  Ack-Ollset)  () 
(write  (read  (nodals)  SM-Space)  (read  (sell)  Ollset)  (read  (sell)  Value)) 
(send  (read  (sell)  Ack-Iode)  6  NORMAL  Adjust-Barrier 

-1  (read  (sell)  Ack-Segment)  (read  (sell)  Ack-Ollset)) 
(destroy-segment  (sell))) 

; ;  Shared  Memory  Two 

; ;  This  version  ol  shared  memory  is  the  most  basic.  It  uses  absolute 
;;  addressing  ol  preallocated  shared  memory  space. 

;;  This  handler  support  a  simple  shared  memory  model.  No  caches  yet! 

;;  An  address  is  broken  into  two  parts. 

I - node  (6) - 1 - ollset  (10) - 1 

: ;  The  address  space  is  composed  ol  up  to  64  chunks ,  each  containing 
; ;  1024  words . 

;;  This  handler  executes  a  "read"  code  Iragment. 

(deline-handler  Read-Address-2  (Address)  (Ollset  Node  Value) 

(write  (sell)  Ollset  (and  (read  (sell)  Address)  1023)) 

(write  (sell)  Node  (a-shilt  (read  (sell)  Address)  -10)) 

(write  (sell)  Node  (and  (read  (sell)  Node)  63)) 

(attribute  (sell)  Value  HRITE-ONCE) 

(send  (read  (sell)  Node)  7  NORMAL  Read  (read  (sell)  Ollset) 

(node-id)  (sell)  Value) 

(return  (read  (sell)  Value)) 

(destroy-segment  (sell))) 

;;  This  handler  executes  a  "write"  code  Iragment. 

(deline-handler  Hrite-Address-2  (Address  Value)  (Barrier  Ollset  Node) 

(write  (sell)  Ollset  (and  (read  (sell)  Address)  1023)) 

(write  (sell)  Node  (a-shilt  (read  (sell)  Address)  -10)) 

(write  (sell)  Node  (and  (read  (sell)  Node)  63)) 

(attribute  (sell)  Barrier  B-SYNC) 

(adjust-count  (sell)  Barrier  1) 

(send  (read  (sell)  Node)  8  NORMAL  Write  (read  (sell)  Ollset) 

(read  (sell)  Value)  (node-id)  (sell)  Barrier) 

(test-count  (sell)  Barrier) 

(destroy-segment  (sell))) 

; ;  Shared  Memory  Three 

;;  This  version  ol  shared  memory  is  similar  to  the  one  above,  but  it 
; ;  uses  a  braided  shared  memory  address . 

;;  This  handler  support  a  simple  shared  memory  model.  No  caches  yet! 

; ;  An  address  is  broken  into  three  parts : 

I - block  (6) - 1 - node  (6) - 1 - ollset  (4) - 1 
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;;  The  address  space  is  composed  of  up  to  64  blocks.  Each  block  is  spread 
;;  across  all  64  nodes.  Each  piece  of  a  block  is  16  uords  long. 

::  This  handler  executes  a  "read"  code  fragment. 

(def ine-handler  Read-Address  (Address)  (Offset  Mode  Value) 

(write  (self)  Value  (and  (read  (self)  Address)  15)) 

(write  (self)  Offset  (a-shift  (read  (self)  Address)  -6)) 

(write  (self)  Offset  (and  (read  (self)  Offset)  1008)) 

(write  (self)  Offset  (or  (read  (self)  Offset)  (read  (self)  Value))) 
(write  (self)  Mode  (a-shift  (read  (self)  Address)  -4)) 

(write  (self)  Mode  (and  (read  (self)  Mode)  63)) 

(attribute  (self)  Value  WRITE-OMCE) 

(send  (read  (self)  Mode)  7  MORMAL  Read  (read  (self)  Offset) 

(node-id)  (self)  Value) 

(return  (read  (self)  Value)) 

(destroy-segment  (self))) 

;;  This  handler  executes  a  "write"  code  fragment. 

(def ine-handler  Write-Address  (Address  Value)  (Barrier  Offset  Mode) 

(write  (self)  Mode  (and  (read  (self)  Address)  15)) 

(write  (self)  Offset  (a-shift  (read  (self)  Address)  -6)) 

(write  (self)  Offset  (and  (read  (self)  Offset)  1008)) 

(write  (self)  Offset  (or  (read  (self)  Offset)  (read  (self)  Mode))) 

(write  (self)  Mode  (a-shift  (read  (self)  Address)  -4)) 

(write  (self)  Mode  (and  (read  (self)  Mode)  63)) 

(attribute  (sel;  Barrier  B-SYMC) 

(adjust-count  (s«.if)  Barrier  l) 

(send  (read  (self)  Mode)  8  MORHAL  Write  (read  (self)  Offset) 

(lead  (self)  Value)  (node-id)  (self)  Barrier) 

(test-count  (seM)  Barrier) 

(destroy-segment  (self))) 


A. 4  Shared  Memory  Tests 


;;  Shared  Memory  Tests 

f  I 

; ;  Scott  Wills  9  October  1989 

*  t 

;;  This  file  contains  tests  for  the  shared  memory  mechanisms. 

(in-package  ’user) 

; ;  This  constant  is  a  multiple  of  the  cache  stride 

(define-constant  Stride  16) 

;;  This  handler  executes  several  tests  on  shared  memory. 

(def ine-handler  Test  ()  () 

(call  5  Shared-Memory-Setup) 

(print-user  ""Ashared  memory  initialized"*") 

(send  (another-node-id  AMY)  6  MORMAL  Test-1  1100) 

(send  (another-node-id  AMY)  6  MORMAL  Test-2  1200) 

(send  (another-node-id  AMY)  5  MORMAL  Test-3  1300) 
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(send  (another-node-id  ASY)  8  NORMAL  Test-4  1400) 

(send  (another-node-id  ANY)  6  NORMAL  Test-S  1500) 

(call  8  Test-6  0  256) 

(.send  (another-node-id  ANY)  8  NORMAL  Test-7  0  4096) 

(destroy-segment  (sell))) 

;;  This  handler  perloms  a  read  on  a  remote  node.  The  value  is  returned, 
(del ine-handler  Remote-Read  (Address  Reply-Node  Reply-Segment  Reply-Offset) 

(Value) 

(write  (self)  Value  (call  7  Read-Address  (read  (self)  Address))) 

(send  (read  (self)  Reply-Node)  6  NORMAL  Reply-Value  (read  (self)  Value) 
(read  (self)  Reply-Segment)  (read  (self)  Reply-Offset)) 
(destroy-segment  (self))) 

;;  This  handler  performs  a  write  on  a  remote  node.  The  barrier  is  reduced, 
(def ine-handler  Remote-Write  (Address  Vcilue  Ack-Node  Ack-Segment  Ack-Offset) 

0 

(call  8  Write-Address  (read  (self)  Address)  (read  (self)  V2due)) 

(send  (read  (self)  Ack-Node)  6  NORMAL  Adjust-Barrier 

-1  (read  (self)  Ack-Segment)  (read  (self)  Ack-Offset)) 
(destroy-segment  (self))) 

;;  This  handler  performs  a  test  on  a  node.  The  b<u:rier  is  reduced. 

(def ine-handler  Remote-Test  (Address  Value  Test 

Ack-Node  Ack-Segment  Ack-Offset)  () 

(call  7  Test-Location  (read  (self)  Address)  (read  (self)  Value) 

(read  (self)  Test)) 

(send  (read  (self)  Ack-Node)  6  NORMAL  Adjust-Barrier 

-1  (read  (self)  Ack-Segment)  (read  (self)  Ack-Offset)) 
(destroy-segment  (self))) 

;;  This  handler  tests  a  memory  location  for  a  specified  value. 

(def ine-handler  Test-Location  (Address  Value  Test)  (Test-Value) 

(write  (self)  Test-Value  (call  7  Read-Address  (read  (self)  Address))) 
(branch-zero  (compare  (read  (self)  Test-Value)  (read  (self)  Value)) 

Skip- Error) 

(print-user  ""AERROR:  Test  "d:  "d  <>  "d'k"  (read  (self)  Test) 

(read  (self)  Test-Value)  (read  (self)  Value)) 

Skip-Error 

(destroy-segment  (self))) 

;;  This  handler  prints  counts  on  hundred  boundaries,  and  new  lines  of 
; ;  thousand  boundaries . 

(def ine-hauidler  Print-Co\int  (Count)  () 

(br2mch-not-zero  (mod  (read  (self)  Count)  1000)  Skip-A) 

(print-user  ""A"  (read  (self)  Count)) 

Skip-A 

(branch-not-zero  (mod  (read  (self)  Count)  100)  Skip-B) 

(print -user  "■4d  "  (read  (self)  Count)) 

Skip-B 

(destroy-segment  (self))) 

;;  This  test  remotely  writes  a  location,  then  reads  it  locally,  then  writes 
; ;  it  locally. 

(def ine-handler  Test-1  (Address)  (Barrier  Result) 

(print-user  "'ATest  1:  read  a  remote  write,  invalidate  READ-ONLY  cache'A") 
(attribute  (self)  Barrier  B-SYNC) 

(Adjust-Count  (self)  Barrier  1) 

(send  (another-node-id  FAR)  8  NORMAL  Remote-Write 

(read  (self)  Address)  11111  (node-id)  (self)  Barrier) 
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(test-count  (sell)  Barrier) 

(call  7  Test-Location  (read  (self)  Address)  11111  1) 

(call  8  Write-Address  (read  (self)  Address)  111110) 

(c2dl  7  Test-Location  (read  (self)  Address)  111110  1) 

(destroy-segment  (self))) 

; ;  This  tests  vrites  a  location,  reads  a  location  that  results  in  a  cache 
;;  replacement,  then  reads  the  first  location  back. 

(def ine-handler  Test-2  (Address)  (Result  Barrier) 

(print-user  ""ATest  2:  same  cache  flush"*") 

(call  8  Write-Address  (read  (self)  Address)  22222) 

(write  (self)  Address  (plus  (read  (self)  Address)  Stride)) 

(attribute  (self)  Barrier  B-SYIC) 

(adjust-count  (self)  Barrier  1) 

(send  (another-node-id  AHY)  8  MORMAL  Remote-Write 

(read  (self)  Address)  20202  (node-id)  (self)  Barrier) 

(test-count  (self)  Barrier) 

(call  7  Read-Address  (read  (self)  Address)) 

(write  (self)  Address  (minus  (read  (self)  Address)  Stride)) 

(c2lI1  7  Test-Location  (read  (self)  Address)  22222  2) 

(destroy-segment  (self))) 

; ;  This  test  locally  writes  a  location,  then  reads  it  remotely  on 
;;  several  nodes. 

(def ine-handler  Test-3  (Address)  (Beo'rier) 

(print-user  ""ATest  3:  multiple  remote  reads  of  a  local  write"*") 

(call  8  Write-Address  (read  (self)  Address)  33333) 

(attribute  (self)  Bairrier  B-SYHC) 

(adjust-count  (self)  Barrier  3) 

(send  (another-node-id  AHY)  9  SORMAL  Remote-Test  (read  (self)  Address) 
33333  3  (node-id)  (self)  Barrier) 

(send  (another-node-id  ANY)  9  NORMAL  Remote-Test  (read  (self)  Address) 
33333  3  (node-id)  (self)  Barrier) 

(send  (another-node-id  ANY)  9  NORMAL  Remote-Test  (read  (self)  Address) 
33333  3  (node-id)  (self)  Barrier) 

(test-count  (self)  Barrier) 

(destroy-segment  (self))) 

; ;  This  test  locally  writes  a  location,  then  reads  it  remotely  on 
;;  several  nodes.  Then  it  writes  it  again  and  reads  it  back  to  meike  sure 
; ;  the  read  chain  was  invalidated. 

(def ine-handler  Test-4  (Address)  (Node-1  Node-2  Node-3  Barrier) 
(print-user  ""ATest  4;  read  chain  invalidation**") 

(write  (self)  Node-l  (another-node-id  FAR)) 

(write  (self)  Node-2  (another-node-id  FAR)) 

(write  (self)  Node-3  (aoiother-node-id  FAR)) 

(call  8  Write-Address  (read  (self)  Address)  40404) 

(attribute  (self)  Barrier  B-SYMC) 

(adjust-count  (self)  Barrier  3) 

(send  (read  (self)  Node-1)  9  NORMAL  Remote-Test  (read  (self)  Address) 
40404  4  (node-id)  (self)  Barrier) 

(send  (read  (self)  Node-2)  9  NORMAL  Remote-Test  (read  (self)  Address) 
40404  4  (node-id)  (self)  Barrier) 

(send  (read  (self)  Node-3)  9  NORMAL  Remote-Test  (read  (self)  Address) 
40404  4  (node-id)  (self)  Barrier) 

(test-count  (self)  Barrier) 

(call  8  Write-Address  (read  (self)  Address)  44444) 
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(adjust-coiint  (sell)  Barrier  3) 

(send  (read  (sell)  Iode-1)  9  MORMAL  Remote-Test  (read  (sell)  Address) 

44444  4  (node-id)  (sell)  Barrier) 

(send  (read  (sell)  Mode-2)  9  MORMAL  Remote-Test  (read  (sell)  Address) 

44444  4  (node-id)  (sell)  Barrier) 

(send  (read  (sell)  Rode-3)  9  MORMAL  Remote-Test  (read  (sell)  Address) 

44444  4  (node-id)  (sell)  Barrier) 

(test-count  (sell)  Barrier) 

(destroy-segment  (sell))) 

;;  This  test  locally  urites  three  cache  conllicting  locations,  then 
; ;  tests  there  values . 

(deline-hemdler  Test-5  (Base)  (Address  Barrier) 

(print -user  "'ATest  5:  multiple  srites  and  reads  on  node' A") 

(attribute  (sell)  Barrier  B-SYMC) 

(adjust-count  (sell)  Barrier  3) 

(write  (sell)  Address  (read  (sell)  Base)) 

(send  (node-id)  8  MORMAL  Remote-Write  (read  (sell)  Address)  555S1 
(node-id)  (sell)  Barrier) 

(write  (sell)  Address  (plus  (read  (sell)  Address)  Stride)) 

(send  (node-id)  8  MORMAL  Remote-Write  (read  (sell)  Address)  55552 
(node-id)  (sell)  Barrier) 

(write  (sell)  Address  (plus  (read  (sell)  Address)  Stride)) 

(send  (node-id)  8  MORMAL  Remote-Write  (read  (sell)  Address)  55553 
(node-id)  (sell)  Barrier) 

(test-count  (sell)  Barrier) 

(write  (self)  Address  (read  (sell)  Base)) 

(adjust-count  (sell)  Barrier  3) 

(send  (node-id)  9  NORMAL  Remote-Test  (read  (sell)  Address) 

55551  5  (node-id)  (sell)  Barrier) 

(write  (sell)  Address  (pfus  (read  (sell)  Address)  Stride)) 

(send  (node-id)  9  NORMAL  Remote-Test  (read  (sell)  Address) 

55552  5  (node-id)  (sell;  Barrier) 

(write  (sell)  Address  (plus  (read  (sell)  Address)  Stride)) 

(send  (node-id)  9  NORMAL  Remote-Test  (read  (sell)  Address) 

55553  5  (node-id)  (sell)  Barrier) 

(test-count  (sell)  Barrier) 

(destroy-segment  (sell))) 

; ;  This  location  remotely  writes  a  block  ol  memory  locations  steurting  at 
;;  Address,  then  remotely  tests  them. 

(deline-handler  Test-6  (Address  Size)  (Limit  Index  Barrier) 

(print-user  "'ATest  6:  remote  sequential  block  write  and  read'A") 

(attribute  (sell)  Barrier  B-SYMC) 

(adjust-count  (sell)  Barrier  (read  (sell)  Size)) 

(write  (sell)  Limit  (plus  (read  (sell)  Address)  (read  (sell)  Size))) 

(write  (sell)  Index  (read  (sell)  Address)) 

(print-user  "'Awriting  ") 

Loop-6A 

(send  (another-node-id  AMY)  8  MORMAL  Remote-Write  (read  (sell)  Index) 

(read  (sell)  Index)  (node-id)  (sell)  Beu:r'ier) 

(write  (sell)  Index  (plus  (read  (sell)  Index)  1)) 

(branch-not-zero  (compare  (read  (sell)  Index)  (read  (sell)  Limit))  Loop-6A) 
(test-count  (sell)  Barrier) 

(adjust-count  (sell)  Bsmrier  (read  (sell)  Size)) 

(write  (sell)  Index  (read  (sell)  Address)) 
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(print-user  "'^reading  •') 

Loop-6B 

(send  (another-node-id  ANY)  9  NORMAL  Remote-Test  (read  (sell)  Index) 

(read  (sell)  Index)  6  (node-id)  (sell)  Baurrier) 

(srite  (sell)  Index  (plus  (read  (sell)  Index)  1)) 

(branch-not-zero  (compare  Cread  (sell)  Index)  (read  (sell)  Limit))  Loop-6B) 
(test-count  (sell)  Barrier) 

(destroy-segment  (sell))) 

;;  Tbis  location  urites  a  block  ol  memory  locations  starting  at  Address, 

; ;  then  reads  them  back . 

(deline-h^mdler  Test-7  (Address  Size)  (Limit  Index  Value) 

(print-user  "'ATest  7:  sequential,  block  write  and  read'A") 

(write  (sell)  Limit  (plus  (read  (sell)  Address)  (read  (sell)  Size))) 

(write  (sell)  Index  (read  (sell)  Address)) 

(print-user  ""Awriting  **) 

Loop-7 A 

(call  4  Print-Count  (read  (sell)  Index)) 

(write  (sell)  Value  (times  (read  (sell)  Index)  10)) 

(call  8  Write-Address  (read  (sell)  Index)  (read  (sell)  Value)) 

(write  (sell)  Index  (plus  (read  (sell)  Index)  1)) 

(branch-not-zero  (compare  (read  (sell)  Index)  (read  (sell)  Limit))  Loop-7A) 
(write  (sell)  Index  (read  (sell)  Address)) 

(print -user  "‘Areading  ") 

Loop-7B 

(call  4  Print-Count  (read  (sell)  Index)) 

(write  (sell)  Value  (times  (read  (sell)  Index)  10)) 

(call  7  Test-Location  (read  (sell)  Index)  (read  (sell)  Value)  7) 

(write  (sell)  Index  (plus  (read  (sell)  Index)  1)) 

(branch-not-zero  (compare  (read  (sell)  Index)  (read  (sell)  Limit))  Loop-7B) 
(destroy-segment  (sell))) 


A. 5  Set  Synchronization 


;;  Set  Synchronization 

I  » 

; ;  Scott  Wills  23  January  1990 

»  t 

; ;  This  is  a  set  synchronization  mechanisms  demonstration, 
(in-package  ’user) 

(deline-constant  WORK-SIZE  50) 

(deline-constant  END  -1) 

(deline-constant  FANOUT  5) 

; ;  These  constant  deline  the  ollset  ol  the  Down-Barrier  in  a  stem  or 
; :  root . 

(deline-constant  DOWN-BARRIER-OFFSET  4) 

; ;  These  constant  deline  the  ollset  ol  the  Down-Segments  segment  in  a 
;  ;  stem  or  root . 

(deline-constant  DOWN-SEGMENTS-OFFSET  6) 
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; ;  This  constant  defines  the  offset  of  the  Up-Sync  for  stems  and 
; ;  leaves . 

(def ine-constant  UP-SYHC-OFFSET  3) 

; ;  This  handler  tests  set  synchronization  by  constructing  a 
;;  synchronization  test  of  a  specified  size,  and  performing  a 
;;  specified  niunber  of  iterations  using  it. 

(def ine-handler  Set-Sync-Test  (Humber  Iterations) 

(Index  Dovn-Barrier  Down-Mode  Down-Segments) 

(write  (self)  Down-lode  (another-node-id  HEIR)) 

(send  (read  (self)  Down-Hode)  14  lORMAL  Stem 
(read  (self)  lumber)  (node-id)  (self)  0) 

(attribute  (self)  Down-Beurrier  B-SYIC) 

(adjust-count  (self)  Down-Barrier  1) 

(write  (self)  Down-Segments  (create-read-write-segment  1)) 

(test-count  (self)  Down-Barrier)  ;;  wait  for  child  to  ack 
(write  (self)  Index  0) 

Main-Loop 

(send  (read  (self)  Down-Mode)  6  NORMAL  Reply-Value  (read  (self)  Index) 
(read  (read  (self)  Down-Segments)  0)  UP-SYMC-OFFSET) 

(adjust-count  (self)  Down-Barrier  1) 

(test-count  (self)  Down-Barrier)  ;;  wait  for  child  finish  signal 
(write  (self)  Index  (plus  (read  (self)  Index)  1)) 

(branch-not-zero  (compare  (read  (self)  Index)  (read  (self)  Iterations)) 
Main-Loop) 

(send  (read  (self)  Down-Mode)  6  NORMAL  Reply-Value  END 

(read  (read  (self)  Down-Segments)  0)  UP-SYMC-OFFSET) 
(destroy-segment  (read  (self)  Down-Segments)) 

(destroy-segment  (self))) 

; ;  This  recursive  handler  creates  the  necessary  tree  structure  and 
;;  leaf  cells  for  the  specified  number  of  cells  and  fanout.  After 
; ;  the  necessary  children  are  created,  the  handler  waits  for  each  to 
;;  return  its  segment  name  via  the  ack  message.  When  all  children 
;;  have  acked,  the  handler  acks  its  parent  and  enters  the  main  loop. 

;;  This  loop  consists  of  the  following  procedure:  First  the  stem 
; ;  waits  for  the  start  signal  from  its  parent.  Then  it  forwards  the 
; ;  start  signal  to  each  of  its  children.  Then  it  tests  the  signal  to 
;;  see  if  it  is  the  END  signad  (end  of  simulation).  If  so,  it 
;;  destroys  itself.  Otherwise  it  waits  for  all  children  to  send 
;;  finish  signals.  Then  it  sends  a  finish  signal  to  its  parent  and 
; ;  returns  to  the  beginning  of  the  main  loop. 

(def ine-handler  Stem  (Number  Up-Mode  Up-Segment  Up-Index) 

(Up-Sync  Down-Barrier  Down-Modes  Down-Segments 
Index  Count  Temp) 

;; (print-user  "'ftcreating  stem  "a  parent  "a"*” 

;;  (self)  (read  (self)  Up-Segment)) 

(write  (self)  Down-Nodes  (create-read-write-segment  FANOUT)) 

(write  (self)  Index  0) 

Loop-A 

(write  (self)  Count  (minus  FANOUT  (read  (self)  Index))) 

(write  (self)  Temp  (ceiling  (read  (self)  Number)  (read  (self)  Count))) 
(branch-zero  (compare  (read  (self)  Temp)  1)  Skip-A) 

(brzmch-not -minus  (compare  (read  (self)  Temp)  FANOUT)  Skip-A) 

(write  (self)  Temp  FANOUT) 

Skip-A 
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(write  (read  (self)  Down-Hodes)  (read  (self)  Index)  (another-node-id  HEAR)) 
(branch-zero  (compare  (read  (self)  Temp)  1)  Skip-B) 

(send  (read  (read  (self)  Down-Modes)  (read  (self)  Index))  14  HORMAL  Stem 
(read  (self)  Temp)  (node-id)  (self)  (read  (self)  Index)) 

(branch-zero  0  Skip-C) 

Skip-B 

(send  (read  (read  (self)  Down-Modes)  (read  (self)  Index))  9  HORMAL  Leaf 
(node-id)  (self)  (read  (self)  Index)) 

Skip-C 

(write  (self)  Index  (plus  (read  (self)  Index)  1)) 

(write  (self)  Humber  (minus  (read  (self)  Humber)  (read  (self)  Temp))) 
(branch-not-zero  (read  (self)  Humber)  Loop-A) 

(write  (self)  Count  (read  (self)  Index)) 

(attribute  (self)  Down-Barrier  B-SYHC) 

(adjust-count  (self)  Down-Barrier  (read  (self)  Count)) 

(write  (self)  Down-Segments 

(create-read-write-segment  (read  (self)  Count))) 

(test-count  (self)  Down-Barrier)  wait  for  children  segment  acks 
(attribute  (self)  Up-Sync  S-SYHC) 

(send  (read  (self)  Up-Hode)  6  HORMAL  Ack 

(self)  (read  (self)  Up-Segment)  (read  (self)  Up-Index)) 

Main-Loop 

(write  (self)  Index  0) 

(write  (self)  Temp  (read  (self)  Up-Sync))  ;;  wait  for  stcirt  signal 
(print-user  ""i'a  starting'A”  (self)) 

Loop-B 

(send  (read  (read  (self)  Down-Modes)  (read  (self)  Index)) 

6  MQRKAL  Reply-Value  (read  (self)  Temp) 

(read  (read  (self)  Down-Segments)  (read  (self)  Index)) 

UP-SYHC-OFFSET) 

(write  (self)  Index  (plus  (read  (self)  Index)  1)) 

(branch-not-zero  (compare  (read  (self)  Index)  (read  (self)  Count))  Loop-B) 
(branch-zero  (compare  (read  (self)  Temp)  EHD)  Simulation-End) 

(adjust-count  (self)  Down-Barrier  (read  (self)  Count)) 

(test-count  (self)  Down-Barrier)  ;;  wait  for  children  finish  signals 
;; (print-user  "'A*a  stopping"*”  (self)) 

(send  (read  (self)  Up-Hode)  6  HORMAL  Adjust-Barrier 

-1  (read  (self)  Up-Segment)  DOVffl-BARRIER-OFFSET) 

(branch-zero  0  Main-Loop) 

Simulation-End 

;; (print-user  ""*"a  ending"*"  (self)) 

(destroy-segment  (read  (self)  Down-Modes)) 

(destroy-segment  (read  (self)  Down-Segments)) 

(destroy-segment  (self))) 

;;  This  handler  creates  a  leaf  (work)  node.  It  first  acks  its 
;;  segment  name  to  it  parent.  Then  it  wait  for  a  start  signal.  If 
;;  the  signal  is  an  EHD  signal,  it  destroys  itself.  Otherwise  it 
;;  performs  one  iteration  of  the  work  and  sends  a  finish  signal. 

; ;  Then  it  awaits  another  start  signal. 

(def ine-handler  Leaf  (Up-Hode  Up-Segment  Up-Index)  (Up-Sync  Index  Temp) 
(print-user  ""*creating  leaf  "a  parent  "a"*"  (self) 

;;  (read  (self)  Up-Segment)) 

(attribute  (self)  Up-Sync  S-SYHC) 

(send  (read  (self)  Up-Hode)  6  HORMAL  Ack 
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(sail)  (read  (sell)  Up-Segment)  (read  (sell)  Up-Index)) 
(branch-zero  0  Wait-Point) 

Main-Loop 

(vrite  (sell)  Index  (minus  (read  (sell)  Index)  1)) 

(branch-not-zero  (read  (sell)  Index)  Main-Loop) 

;;  (print-user  ""fa  uorking'A"  (sell)) 

(send  (read  (sell)  Up-Kode)  6  lORNAL  Adjust-Barrier 

-1  (read  (sell)  Up-Segment)  OOWX-BARRIER-OFFSET) 

Wait-Point 

(write  (sell)  Index  WORK-SIZE) 

(write  (sell)  Temp  (read  (sell)  Up-Sync))  ;;  wait  lor  start  signal 
(branch-not-zero  (compare  (read  (sell)  Temp)  EID)  Main-Loop) 

;;  (print-user  ""fa  ending'A'*  (sell)) 

(destroy-segment  (sell))) 

; ;  This  handler  acknowledges  the  creation  ol  a  child  (leal  or  stem) 

; ;  and  writes  the  segment  at  the  appropriate  place  in  the 
; ;  Down-Segments  segment.  It  then  reduces  the  Down-Barrier  by  one. 
(deline-handler  Ack  (Segment  Up-Segment  Up-Index)  () 

(write  (read  (read  (sell)  Up-Segment)  DOWI-SEGMEITS-OFFSET) 

(read  (sell)  Up-Indox)  (read  (sell)  Segment)) 

(adjust-count  (read  (sell)  Up-Segment)  DOWN-BARRIER-OFFSET  -1) 
(destroy-segment  (sell))) 


A. 6  Object  Name  Translation 


.-anslation 

;  ;  Scott  Wills  26  Janueiry  1990 

«  » 

;;  This  is  an  object  name  translation  mechanism  writen  in  Pi. 
(in-package  ’user) 

; ;  This  constant  is  the  number  ol  maps . 

(deline-constant  lUMBER-OF-MAPS  10) 

; ;  This  constant  delines  the  initial  map  size 
(deline-constant  INITIAL-MAP-SIZE  26) 

; ;  This  constant  is  the  number  ol  agents . 

(deline-constant  NUMBER-OF-AGEITS  26) 

;;  These  constants  delines  ollsets  into  the  Agent  segment, 
(deline-constant  AGEMT-MAP-NODES  0) 

(deline-constant  AGENT-MAP-SEGMEITS  1) 

(deline-constant  AGENT-NEXT-NODE  2) 

(deline-constant  AGENT-NEXT-SEGMEIT  3) 

(deline-constant  AGENT-SIZE  4) 

;;  These  constants  delines  ollsets  into  the  client  segments, 
(deline-constant  CLIENT-AGENT-HODE  0) 

(deline-constant  CLIENT-AGENT-SEGMENT  1) 

(deline-constant  CLIENT-CACHE  2) 

(deline-constant  CLIENT-SIZE  3) 
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;;  This  constant  defines  the  initial  client  translation  cache  size, 
(define-constant  IHITIAL-CLIENT'CACHE-SIZE  10) 

; ;  This  handler  initializes  the  translation  system  (except  for  the 
; ;  clients  shich  must  be  created  separately  vhere  needed.  It  creates 
;;  the  maps  and  agents.  A  prototype  agent  location  is  returned  (node 
;;  and  segment). 

(def ine-handler  Initialize-Translations  (Reply-Hode  Reply-Segment 

Reply-Agent-Iode 
Reply- Agent-Segment ) 
(Agent-Hode  Agent -Segment ) 
(srite  (self)  Agent-Hode  (another-node-id  FAR)) 

(send  (read  (self)  Reply-Hode)  6  HORMAL  Reply-Value 

(read  (self)  Agent-Mode)  (read  (self)  Reply-Segment) 

(read  (self)  Reply-Agent-Iode)) 

(send  (read  (self)  Agent-Node)  10  lORNAL  Create-Agent  000 
(node-id)  (self)  Agent-Segment) 

(send  (read  (self)  Reply-Mode)  6  NORMAL  Reply-Value 
(read  (self)  Agent -Segment) 

(read  (self)  Reply-Segment)  (read  (self)  Reply-Agent-Segment)) 
(send  (another-node-id  FAR)  6  HORMAL  Create-Map 

0  (read  (self)  Agent-Node)  (read  (self)  Agent-Segment)) 
(destroy-segment  (self))) 

; ;  This  handler  creates  a  neu  map.  The  segment  name  is  replied  to  the 
;;  head  of  the  agent  ring.  If  the  index  is  less  than  NUMBER-OF-MAPS , 

; ;  an  additional  map  is  created. 

(def ine-handler  Create-Map  (Index  Reply-Node  Reply-Segment)  () 

(send  (read  (self)  Reply-Hode)  8  NORMAL  Map-Location-Update 
0  (node-id) 

(create-associative-segment  IIITIAL-MAP-SIZE  UNBOUND-SAFE) 

(read  (self)  Index)  (read  (self)  Reply-Segment)) 

(urite  (self)  Index  (plus  (read  (self)  Index)  1)) 

(branch-zero  (compare  (read  (self)  Index)  NUMBER-OF-MAPS)  Skip-A) 
(send  (cinother-node-id  FAR)  6  NORMAL  Create-Map  (read  (self)  Index) 
(read  (self)  Reply-Mode)  (read  (self)  Reply-Segment)) 

Skip-A 

(destroy-segment  (self))) 

; ;  This  handler  creates  the  agent  ring.  This  procedure  is  a  bit 
;;  complicated.  When  the  ring  is  create,  it  is  created  in  the  reverse 
; ;  direction  of  normal  message  travel.  Each  new  agent  receives  the 
;;  node  and  segment  of  the  agent  in  front  of  it.  It  also  forwards  the 
; ;  return  field  including  the  node  and  segment  ID  of  the  start  of  the 
;;  ring  (to  close  the  ring  at  the  last  agent).  An  index  of  zero 
;;  indicates  the  starting  agent.  He  returns  his  segment  to  the  root 
; ;  and  passes  his  node  and  segment  ID  as  the  return  field  (for  the 
;;  ring  closing).  For  the  starting  agent  case,  the  input  parameters 
;;  are  used  abnormally.  Return-Node  contains  the  root  node  number. 

; ;  Return-Segment  contains  the  root  segment  number.  Return-Offset 
;;  contains  the  segment  offset  of  returned  value. 

(def ine-handler  Create-Agent  (Index  Next-Node  Next-Segment 

Return-Node  Return-Segment  Return-Offset) 
(Agent) 

(write  (self)  Agent  (create-read-write-segment  AGENT-SIZE)) 

(write  (read  (self)  Agent)  AGENT-MAP-NODES 

(create-read-write-segment  NUMBER-OF-MAPS) ) 
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(vritfl  (read  (sell)  Agent)  AGENT-NAP-SEGMEHTS 

(create-read-vrite-segment  lUNBER-OF-NAPS) ) 

(write  (self)  Index  (plus  (read  (self)  Index)  1)) 

(branch-not-zero  (compare  (read  (self)  Index)  1)  Skip-A) 

(send  (read  (self)  Retum-Mode)  6  lORMAL  Reply-Value 
(read  (self)  Agent)  (read  (self)  Return-Segment) 

(read  (self)  Return-Offset)) 

(write  (self)  Retum-Iode  (node-id)) 

(write  (self)  Return-Segment  (read  (self)  Agent)) 

(branch-zero  0  Skip-B) 

Skip-A 

(write  (read  (self)  Agent)  AGEIT-IEXT-IODE  (read  (self)  Bext-lode)) 

(write  (read  (self)  Agent)  AGEHT-IEXT-SEGNEMT  (read  (self)  lext-Segment) ) 
Skip-B 

(branch-not-zero  (compare  (read  (self)  Index)  HUMBER-OF-AGEHTS)  Skip-C) 
(send  (read  (self)  Retum-Hode)  6  XORMAL  Reply-Value 

(node-id)  (read  (self)  Return-Segment)  AGEIT-MEXT-IODE) 

(send  (read  (self)  Retum-Hode)  6  HQRNAL  Reply-Value 

(read  (self)  Agent)  (read  (self)  Return-Segment)  AGEIT-HEXT-SEGMENT) 
(branch-zero  0  Skip-D) 

Skip-C 

(send  (another-node-id  FAR)  10  MORMAL  Create-Agent 

(read  (self)  Index)  (node-id)  (read  (self)  Agent) 

(read  (self)  Return-Node)  (read  (self)  Return-Segment)  0) 

Skip-D 

(destroy-segment  (self))) 

;;  This  handler  creates  a  new  client.  The  segment  name  is  replied  to  the 
;;  root  table.  It  then  requests  the  nearest  agent,  and  initializes  the 
;;  local  translation  cache. 

(define-handler  Create-Client  (Agent-Hode  Agent-Segment 

Reply-Rode  Reply-Segment  Reply-Index)  (Client) 
(write  (self)  Client  (create-read-write-segment  CLIENT-SIZE)) 

(send  (read  (self)  Reply-Node)  6  NORMAL  Reply-VeQue  (read  (self)  Client) 
(read  (self)  Reply-Segment)  (read  (self)  Reply-Index)) 

(send  (read  (self)  Agent-Node)  10  NORMAL  Closest-Agent 

0  (read  (self)  Agent-Segment)  (node-id)  (read  (self)  Client)  -1  0  0) 
(attribute  (read  (self)  Client)  CLIENT-AGEHT-HODE  D-SYIC) 

(attribute  (read  (self)  Client)  CLIENT-AGENT-SEGMENT  D-SYHC) 

(write  (read  (self)  Client)  CLIENT-CACHE 

(create-associative-segment  INITIAL-CLIENT-CACHE-SIZE  BOUND-UNSAFE) ) 
(destroy-segment  (self))) 

;;  This  handler  finds  the  closest  agent  to  a  client.  Each  agent 
;;  compares  its  distance  to  the  current  distance.  If  it  is  closer,  it 
; ;  updates  the  agent  node  and  segment  with  it  own  value  and  sets  the 
; ;  new  distance.  When  the  last  agent  makes  the  comparison,  the 
;;  closest  agent  is  returned  to  the  client. 

(define-handler  Closest-Agent  (Count  This-Agent  Client-Node  Client-Segment 

Distance  Agent-Node  Agent-Segment)  () 
(branch-zero  (compare  (read  (self)  Distance)  -1)  Skip-A) 

(branch-not-minus  (compare  (distance  (read  (self)  Client-Node)  (node-id)) 

(read  (self)  Distance))  Skip-B) 

Skip-A 

(write  (self)  Distance  (distance  (read  (self)  Client-Node)  (node-id))) 
(write  (self)  Agent-Node  (node-id)) 
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(«rit«  (sell)  Agent-Segment  (read  (sell)  This-Agent)) 

Skip-B 

(write  (sell)  Count  (plus  (read  (sell)  Count)  1)) 

(branch-zero  (compare  (read  (sell)  Count)  lUMBER-OF-AGEHTS)  Skip-C) 
(send  (read  (read  (sell)  This-Agent)  AGEHT-IEXT-IODE) 

10  lORMAL  Closest-Agent  (read  (sell)  Count) 

(read  (read  (sell)  This-Agent)  AGEIT-IEXT-SEGMEIT) 

(read  (sell)  Client-Node)  (read  (sell)  Client -Segment) 

(read  (sell)  Distance)  (read  (sell)  Agent -Node) 

(read  (sell)  Agent-Segment)) 

(branch-zero  0  Skip-D) 

Skip-C 

(send  (read  (sell)  Client-lode)  6  NORMAL  Reply-Veilue 

(read  (sell)  Agent-lode)  (read  (sell)  Client -Segment) 

CLIENT- AGENT-NODE ) 

(send  (read  (sell)  Client-Node)  6  NORMAL  Reply-Value 

(read  (sell)  Agent-Segment)  (read  (sell)  Client-Segment) 
CLIENT-AGENT-SEGMENT) 

Skip-D 

(destroy-segment  (sell))) 

;;  This  handler  updates  a  map  location  maintained  by  each  agent.  The 
; ;  map  number  is  indicated  by  Map-Index.  II  the  count  is  not  equal  to 
; ;  NUNBER-OF-AGENTS ,  the  message  is  also  lorwarded  to  the  next  agent 
; ;  (with  the  count  incremented) . 

(deline-handler  Map-Location-Update  (Count  Map-Node  Map-Segment  Map-Index 

Agent-Segment )  ( ) 

(write  (read  (read  (sell)  Agent-Segment)  AGENT-MAP-NODES) 

(read  (sell)  Map-Index)  (read  (sell)  Map-Node)) 

(write  (read  (read  (sell)  Agent-Segment)  AGENT-MAP-SEGMEHTS ) 

(read  (sell)  Map-Index)  (read  (sell)  Map-Segment)) 

(write  (sell)  Count  (plus  (read  (sell)  Count)  1)) 

(branch-zero  (compare  (read  (sell)  Count)  NUMBER-OF- AGENTS)  Skip-A) 
(send  (read  (read  (sell)  Agent -Segment )  AGENT-NEXT-NODE) 

8  NORMAL  Nap- Location-Update 

(read  (sell)  Count)  (read  (sell)  Map-Node) 

(read  (sell)  Nap-Segment)  (re2ul  (sell)  Nap-Index) 

(read  (read  (sell)  Agent-Segment)  AGENT-NEXT-SEGMENT)) 

Skip-A 

(destroy-segment  (sell))) 

;;  This  handler  inserts  a  translation  on  the  specilied  client.  It 
;;  must  be  invoked  on  the  same  node  as  the  client.  The  h2mdler  waits 
; ;  lor  an  acknowledgment  belore  ending.  II  this  handler  is  called, 

;;  the  insertion  is  synchronous.  II  is  sent  (on  the  same  node),  the 
;;  insertion  is  asynchronous. 

(deline-handler  Insert  (Name  Value  Client-Segment)  (Barrier) 

(attribute  (sell)  Barrier  B-SYIC) 

(adjust-count  (sell)  Barrier  l) 

(send  (read  (read  (sell)  Client-Segment)  CLIENT-AGENT-NODE) 

10  lORMAL  Agent-Insert 

(read  (sell)  Name)  (read  (sell)  Value) 

(read  (read  (sell)  Client-Segment)  CLIENT-AGENT-SEGMENT) 

(node-id)  (sell)  Barrier) 

(test-count  (sell)  Barrier) 

(destroy-segment  (sell))) 
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;;  This  called  handler  matches  a  translation  on  the  specified  client. 

; ;  It  must  be  invoked  on  the  same  node  as  the  client .  The  handler 
;;  first  checks  for  a  match  in  its  local  agent  cache.  Failing  there, 

; ;  it  sends  a  match  request  to  the  agent.  The  value  is  returned  to 
;;  the  caller. 

(define-handler  Match  (Hama  Client -Segment)  (Value) 

(vrite  (self)  Value  (match  (read  (read  (self)  Client -Segment)  CLIEHT-CACHE) 

(read  (self)  lama))) 

(branch-not-zero  (compare  (read  (self)  Value)  UHBOUHD)  Skip-A) 

(attribute  (self)  Value  D-SYIC) 

(send  (read  (read  (self)  Client-Segment)  CLIEIT-AGEHT-HOOE) 

9  lORMAL  Agent-Match 
(read  (self)  Marne) 

(read  (read  (self)  Client-Segment)  CLIEHT-AGEHT-SEGMEIT) 

(node-id)  (self)  Value) 

(insert  (read  (read  (self)  Client-Segment)  CLIEHT-CACHE) 

(read  (self)  Hame)  (read  (self)  Value)) 

Skip-A 

(return  (read  (self)  Value)) 

(destroy-segment  (self))) 

; ;  This  called  handler  matches  a  translation  uhen  an  eaorlier 
;;  translation  produces  an  incorrect  result.  It  first  invalidates  the 
;;  translation  in  the  client  cache  (if  one  exists).  Then  it  requests 
; ;  the  translation  using  the  Match  handler. 

(define-handler  Rematch  (Hame  Client -Segment)  () 

(remove  (read  (read  (self)  Client-Segment)  CLIEHT-CACHE) 

(read  (self)  Hame)) 

(return  (call  6  Match  (read  (self)  Hame)  (read  (self)  Client-Segment))) 
(destroy-segment  (self))) 

; ;  This  handler  inserts  a  translation  into  the  distributed  translation 
; ;  map.  An  ack  is  returned. 

(define-handler  Agent-Insert  (Hame  Value  This-Agent 

Ack-Hode  Ack-Segment  Ack-Offset)  (Index) 

(write  (self)  Index  (mod  (read  (self)  Name)  HUMBER-OF-MAPS) ) 

(send  (read  (read  (read  (self)  This-Agent)  AGEHT-MAP-NODES) 

(read  (self)  Index)) 

9  HORMAL  Insert-In-Map 

(read  (read  (read  (self)  This-Agent)  AGEHT-MAP-SEGMEHTS) 

(read  (self)  Index)) 

(read  (self)  Hame)  (read  (self)  Value)  (read  (self)  Ack-Hode) 

(read  (self)  Ack-Segment)  (read  (self)  Ack-Offset)) 

(destroy-segment  (self))) 

; ;  This  handler  matches  a  translation  into  the  distributed  translation 
; ;  map.  The  resulting  value  is  returned. 

(define-handler  Agent-Match  (Hame  This-Agent 

Reply-Mode  Reply-Segment  Reply-Offset)  (Index) 
(write  (self)  Index  (mod  (read  (self)  Marne)  NUNBER-OF-MAPS) ) 

(send  (read  (read  (read  (self)  This-Agent)  AGEHT-MAP-HODES) 

(read  (self)  Index)) 

8  HORMAL  Hatch-In-Map 

(read  (read  (read  (self)  This-Agent)  AGEHT-MAP-SEGMEHTS) 

(read  (self)  Index)) 

(read  (self)  Hame)  (read  (self)  Reply-Mode) 

(read  (self)  Reply-Segment)  (read  (self)  Reply-Offset)) 
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(destroy-segment  (sell))) 

; ;  This  handler  inserts  a  translation  to  the  specified  Dap.  An 
; ;  acknowledgment  is  returned . 

(def ine-handler  Insert-ln-Map  (This-Map  Name  Value 

Ack-lode  Ack-Segment  Ack-Offset)  () 
(insert  (read  (self)  This-Map)  (read  (self)  lame)  (read  (self)  Value)) 
(send  (read  (self)  Ack-Hode)  6  MORMAL  Adjust-Barrier 

-1  (read  (self)  Ack-Segment)  (read  (self)  Ack-Offset)) 
(destroy-segment  (self))) 

This  handler  matches  a  translation  in  the  specified  map.  The  value 
; ;  is  returned. 

(def ine-handler  Natch- In-Map  (This-Map  Mame 

Reply-Mode  Reply-Segment  Reply-Offset)  () 
(send  (read  (self)  Reply-Mode)  6  MORMAL  Reply-Value 

(match  (read  (self)  This-Map)  (read  (self)  Mame)) 

(read  (self)  Reply-Segment)  (read  (self)  Reply-Offset)) 
(destroy-segment  (self))) 


A. 7  Object  Name  Translation  Tests 


;;  Translation  Tests 

I  I 

; ;  Scott  Wills  1  February  1990 

>  I 

; ;  This  file  contains  tests  for  the  translation  mechamism. 
(in-package  'user) 

;;  This  constant  is  the  number  of  clients. 

(define-constant  MUMBER-OF-CLIEMTS  25) 

; ;  This  handler  begins  by  initializing  the  translation  network.  It 
;;  then  creates  the  clients.  It  then  executes  several  tests  on  shared 
; ;  memory.  All  tests  requiring  client  locations  must  be  executed  on 
; ;  the  same  node  as  this  handler. 

(def ine-handler  Test  ()  (Agent-Mode  Agent-Segment 

Client-Modes  Client-Segments  Index) 
(print-user  ""^Translation  Tests"*'*) 

(send  (another-node-id  AMY)  9  MORMAL  Initialize-Translations 
(node-id)  (self)  Agent-Mode  Agent-Segment) 

(write  (self)  Client-Modes 

(create-read-write-segment  MUNBER-OF-CLIEMTS) ) 

(write  (self)  Client-Segments 

(create-read-write-segment  MUMBER-OF-CLIEMTS) ) 

(write  (self)  Index  0) 

Loop-A 

(write  (read  (self)  Client-Nodes)  (read  (self)  Index) 
(another-node-id  AMY)) 

(send  (read  (read  (self)  Client-Modes)  (read  (self)  Index)) 

9  NORMAL  Create-Client 

(read  (self)  Agent-Node)  (read  (self)  Agent-Segment) 
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(noda-id)  (raad  (sail)  Cliant-Sagnants)  (raad  (sail)  Indaz)) 

(vrita  (sail)  Indaz  (plus  (raad  (sail)  Indaz)  1)) 

(branch-not-zaro  (compara  (raad  (sell)  Indaz)  HUMBER-OF-CLIEHTS)  Loop-A) 
(call  9  Test-1  100  (read  (sell)  Client-Vodes) 

(read  (sell)  Client-Sagsiants)) 

(call  9  Test-2  5  10  (read  (sail)  Cliant-Hodas ) 

(raad  (sail)  Cliant-Sagnants)) 

(dastroy-segnent  (sell))) 

;;  This  test  inserts  a  spacilied  number  ol  translations,  units  lor  all 
; ;  to  be  acknovladged,  than  reads  than  all  back  and  tests  them.  It 
; ;  then  rereads  all  translations  (on  the  same  clients)  to  test  the 
; ;  client  caches  (check  the  number  ol  inserts  and  matches  in  the 
;;  prolile). 

(deline-handler  Test-1  (Humber  Client-Hodas  Client-Segments) 

(Indaz  Client  Barrier) 

(print-user  "'kTest  1:  sequential  insert  and  match  and  cache  test'k") 
(attribute  (sell)  Barrier  B-SYHC) 

(adjust-count  (sell)  Barrier  (read  (sell)  lumber)) 

(urite  (sell)  Indaz  0) 

Loop-A 

(urite  (sell)  Client  (mod  (read  (sell)  Indez)  HUNBER-OF-CLIENTS) ) 

(sand  (read  (read  (sell)  Client-Hodes)  (read  (sell)  Client)) 

9  HORNAL  Client-Insert 

(read  (sell)  Indez)  (times  (read  (sell)  Indez)  10) 

(read  (read  (sell)  Client-Segments)  (read  (sell)  Client)) 

(node-id)  (sell)  Barrier) 

(write  (sell)  Indez  (plus  (read  (sell)  Indez)  1)) 

(branch-not-zero  (compare  (read  (sell)  Indez)  (read  (sell)  Humber))  Loop-A) 
(test-count  (sell)  Barrier) 

(adjust-count  (sell)  Barrier  (read  (sell)  Humber)) 

(write  (sell)  Indez  0) 

(send  (node-id)  8  HORMAL  Compare-All-Clients  (read  (sell)  Humber) 

(read  (sell)  Client-Hodes)  (read  (sell)  Client-Segments)) 

(send  (node-id)  8  HORMAL  Compare-All-Clients  (read  (sell)  Humber) 

(read  (sell)  Client-Hodes)  (read  (sell)  Client -Segments)) 
(destroy-segment  (sell))) 

;;  This  tests  creates  a  client  cache  inconsistency,  then  corrects  it. 

; ;  The  lollowing  steps  are  perlormed: 

;;  (1)  the  translation  (1000,  12345)  is  inserted  on  Client-B 
; ;  (2)  the  translation  lor  1000  is  compared  on  Client-A  (caching  it) 

;;  (3)  the  translation  (1000,  54321)  is  inserted  on  Client-B  (changing  it) 

; ;  (4)  the  translation  on  Client-A  is  tested  lor  the  old  value  (12345) 

; ;  (5)  the  translation  on  Client-A  is  retested  lor  the  new  value  (54321) 
(deline-handler  Test-2  (Client-A  Client-B  Client-Hodes  Client-Segments) 

(Value  Barrier) 

(print-user  "'kTest  2:  client  cache  consistency  test'k") 

(attribute  (sell)  Barrier  B-SYHC) 

(adjust-count  (sell)  Barrier  1) 

(send  (read  (read  (sell)  Client-Hodes)  (read  (sell)  Client-B)) 

9  HORNAL  Client-Insert  1000  12345 

(read  (read  (sell)  Client-Segments)  (read  (sell)  Client-B)) 

(node-id)  (sell)  Barrier) 

(test-count  (sell)  Barrier) 
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(send  (read  (read  (self)  Client-Modes)  (read  (self)  Client-A)) 

9  NORMAL  Client-Match  1000 

(read  (read  (self)  Client-Segments)  (read  (self)  Client-A)) 

(node-id)  (self)  Value) 

(branch-zero  (compare  (read  (self)  Value)  12345)  Skip-A) 

(print-user  "'AERROR:  *a  translation  value  'a  does  not  match  expected  'a~A" 
1000  (read  (self)  Value)  12346) 

Skip-A 

(attribute  (self)  Value  D-SYSC) 

(adjust-count  (self)  Barrier  1) 

(send  (read  (read  (self)  Client-Modes)  (read  (self)  Client-B)) 

9  NORMAL  Client-Insert  1000  5432 1 

(read  (read  (self)  Client-Segments)  (read  (self)  Client-B)) 

(node-id)  (self)  Barrier) 

(test-count  (self)  Barrier) 

(send  (read  (read  (self)  Client-Modes)  (read  (self)  Client-A)) 

9  NORMAL  Client-Match  1000 

(read  (read  (self)  Client-Segments)  (read  (self)  Client-A)) 

(node- id)  (self)  Value) 

(branch-zero  (compare  (read  (self)  Value)  12345)  Skip-B) 

(print-user  "■ftERROR;  "a  tramslation  value  *a  does  not  match  expected  "a*A" 
1000  (read  (self)  Value)  12345) 

Skip-B 

(attribute  (self)  Value  D-SYMC) 

(send  (read  (read  (self)  Client-Modes)  (read  (self)  Client-A)) 

9  NORMAL  Client-Rematch  1000 

(read  (read  (self)  Client-Segments)  (read  (self)  Client-A)) 

(node-id)  (self)  Value) 

(branch-zero  (compare  (read  (self)  Value)  54321)  Skip-C) 

(print-user  "'AERROR:  "a  translation  value  *a  does  not  match  expected  *a"A" 
1000  (read  (self)  Value)  54321) 

Skip-C 

(destroy-segment  (self))) 

; ;  This  handler  tests  a  specified  number  of  translations  on  the 
;;  clients.  This  handler  must  be  executed  on  the  same  node  as  the 
;;  client  node  and  segment  arrays. 

(def ine-handler  Compare-All-Clients  (Number  Client-Nodes  Client-Segments) 

(Index  Client) 

(srite  (self)  Index  0) 

Loop-A 

(write  (self)  Client  (mod  (read  (self)  Index)  NUHBER-OF-CLIENTS) ) 

(send  (read  (read  (self)  Client-Nodes)  (read  (self)  Client)) 

7  NORMAL  Client-Compare 

(read  (self)  Index)  (times  (read  (self)  Index)  10) 

(read  (read  (self)  Client-Segments)  (read  (self)  Client))) 

(write  (self)  Index  (plus  (read  (self)  Index)  1)) 

(branch-not-zero  (compare  (read  (self)  Index)  (read  (self)  Number))  Loop-A) 
(destroy-segment  (self))) 

;;  This  handler  inserts  a  translation  on  the  specified  client,  and 
; ;  waits  for  the  insertion  to  complete.  This  handler  must  be  executed 
;;  on  the  same  node  as  the  client.-  An  acknowledgment  is  returned 
;;  when  the  insertion  completes. 

(def ine-handler  Client-Insert  (Name  Value  Client-Segment 

Ack-Node  Ack-Segment  Ack-Offset)  () 
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(call  7  Insert  (read  (self)  Hame)  (read  (self)  Value) 

(read  (self)  Client-Segment)) 

(send  (read  (self)  Ack-Node)  6  lORMAL  Adjust-Barrier 

-1  (read  (self)  Ack-Segment)  (read  (self)  Ack-Offset)) 
(destroy-segment  (self))) 

; ;  This  handler  matches  a  translation.  It  requests  the  match  from  the 
;;  client.  If  the  translated  value  does  not  match  the  specified 
;;  value,  an  error  is  printed.  This  test  must  be  run  on  the  same  node 
; ;  as  the  client. 

(def ine-handler  Client-Compare  (Rame  Compare-Value  Client-Segment)  (Value) 
(nrite  (self)  Value 

(call  6  Match  (read  (self)  lame)  (read  (self)  Client-Segment))) 
(branch-zero  (compare  (read  (self)  Value)  (read  (self)  Compare -Value)) 

Skip) 

(print-user  "'ftERROR:  ~a  translation  value  *a  does  not  match  expected  *a*t" 
(read  (self)  Rame)  (read  (self)  Value) 

(read  (self)  Compare-Value)) 

Skip 

(destroy-segment  (self))) 

; ;  This  handler  matches  a  translation.  It  returns  the  matched  value 
; ;  to  the  caller.  This  handler  must  must  be  executed  on  the  same  node 
;;  as  the  client. 

(def ine-handler  Client-Hatch  (Rame  Client-Segment 

Reply-Rode  Reply-Segment  Reply-Offset)  (Value) 

(nrite  (self)  Value 

(call  6  Match  (read  (self)  Rame)  (read  (self)  Client-Segment))) 

(send  (read  (self)  Reply-Rode)  6  RORMAL  Reply-Value  (read  (self)  Value) 
(read  (self)  Reply-Segment)  (read  (self)  Reply-Offset)) 
(destroy-segment  (self))) 

;;  This  handler  is  simileur  to  Client-Match,  except  Rematch  is  invoked. 

;;  This  handler  must  must  be  executed  on  the  same  node  as  the  client. 

(def ine-handler  Client-Rematch  (Rame  Client-Segment  Reply-Mode 

Reply-Segment  Reply-Offset)  (Value) 

(write  (self)  Value 

(call  5  Rematch  (read  (self)  Rame)  (read  (self)  Client-Segment))) 
(send  (read  (self)  Reply-Rode)  6  RORMAL  Reply-Value  (read  (self)  Value) 
(read  (self)  Reply-Segment)  (read  (self)  Reply-Offset)) 
(destroy-segment  (self))) 


A. 8  Non-Resident  Handlers 


;;  Ron-Resident  Handler 

*  > 

: :  Scott  Wills  1  February  1990 

I  > 

;;  This  example  demonstrates  non-resident  handler  support  in  Pi. 
(in-package  ’user) 

;;  These  constants  define  offsets  into  a  handler  segment. 
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(deline-constamt  HAKDLER-LEIIGTH  0) 

(def ine-constant  HAVDLER-NAME  1) 

;;  This  constant  defines  the  number  of  nodes. 

(define-constant  HUNBER-QF-NODES  64) 

;;  This  constant  defines  the  reference  and  current  handler  map  size. 

; ;  The  reference  handler  map  can  gros  larger;  the  current  handler  map 
; ;  size  is  fixed. 

(define-constant  HAVDLER-NAP-SIZE  5) 

;;  This  nodal  maintains  the  node’s  client  segment. 

(def ine-nodal  Hode-Client) 

; ;  This  nodal  maintains  the  node's  reference  non-resident  handler  map. 

(def ine-nodal  Reference-Handlers) 

;;  This  nodal  maintains  the  node's  current  non-resident  handler  map. 

(def ine-nodal  Current-Handlers) 

; ;  This  nodal  maintains  the  set  of  outstanding  handler  requests 
(def ine-nodal  Requested-Handlers) 

; ;  This  is  test  handler  for  the  non-resident  handler  mechrmism.  It 
; ;  starts  by  initializing  the  HR  handler  system.  Then  it  injects  a 
; ;  test  handler  into  the  distributed  reference  handler  map  and  inserts 
; ;  then  into  the  handler  name  translation  system.  Then  it  invokes  it 
; ;  on  each  node  tvice. 

(def ine-handler  Test  ()  (Barrier  Index) 

(attribute  (self)  Barrier  B-SYHC) 

(adjust-count  (self)  Barrier  l) 

(send  (another-node-id  AMY)  10  HQRNAL  Initialize-NR-Handlers 
(node-id)  (self)  Barrier) 

(test-count  (self)  Barrier) 

(adjust-count  (self)  Barrier  1) 

(send  (another-node-id  ANY)  12  NORMAL  Add-Handler 
25  (node-id)  (self)  Baurrier  Test-NR-Handler) 

(test-count  (self)  B2urrier) 

(urite  (self)  Index  0) 

Loop-A 

(send  (read  (self)  Index)  14  NORMAL  Dispatch  25  HAVE  A  NICE  DAY) 

(urite  (self)  Index  (plus  (read  (self)  Index)  1)) 

(branch-minus  (comp^Lre  (read  (self)  Index)  NUMBER-OF-NODES )  Loop-A) 
(write  (self)  Index  0) 

Loop-B 

(send  (read  (self)  Index)  14  NORMAL  Dispatch  25  HAVE  A  NICE  DAY) 

(write  (self)  Index  (plus  (read  (self)  Index)  1)) 

(brimeh-minus  (compare  (read  (self)  Index)  NUMBER-OF-NODES)  Loop-B) 
(destroy-segment  (self))) 

;;  This  handler  is  the  test  case  for  a  non-resident  handler.  It 
;;  prints  the  contents  of  the  active  segment. 

(def ine-heuidler  Test-NR-Handler  (rVARIABLE)  (Index) 

(print-user  ""knr  handler  dispatched  on  node  "d:  [  "  (node-id)) 

;;  (write  (self)  Index  0) 

; ;  Loop 

;;  (print-user  ""a  "  (read  (self)  (read  (self)  Index))) 

;;  (write  (self)  Index  (plus  (read  (self)  Index)  1)) 

;;  (branch-minus  (compare  (read  (self)  Index)  (read  (self)  LENGTH))  Loop) 
(print-user  "]**") 

(write  (self)  Index  (times  (read  (self)  LENGTH)  (read  (self)  LENGTH))) 
(destroy-segment  (self))) 
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;;  Tbis  this  is  the  main  handler  lor  non-resident  handler  support. 

;;  Uhen  a  non-resident  handler  is  invoked,  this  handler  is  called  as 
;;  the  normal  handler  type.  It  then  uses  the  lirst  argument  as  the 
; ;  non-resident  handler  type.  It  tests  il  the  handler  is  present  on 
;;  the  node.  II  not,  it  obtains  it.  Then  it  invokes  it  sitb  the 
;;  remaining  eu’giiments.  A  special  argument  segment  must  be  created. 

; ;  Since  the  locals  lor  non-resident  handler  are  included  in  this 
;;  space,  care  must  be  taken  to  only  copy  parameters  into  it.  Hote; 

; ;  since  PiSim  presently  does  not  support  direct  handler  segment 
;;  calls,  the  lirst  argument  ol  the  non-resident  handler  is  a  symbol 
; ;  ol  a  normal  handler. 

(deline-handler  Dispatch  (Kama  ‘.VARIABLE)  (Handler  Message-Length  Arguments 

Index-A  Index-B) 

(write  (sell)  Handler  (match  (read  (nodals)  Current-Handlers) 

(read  (sell)  Hame))) 

(branch-not-zero  (compare  (read  (sell)  Handler)  UHBOUND)  Handler-Known) 
(call  6  Lookup-Handler  (read  (sell)  Marne)) 

(write  (sell)  Handier  (match  (read  (nodals)  Current-Handlers) 

(read  (sell)  Name))) 

Handler-Known 

(write  (sell)  Message-Length  (minus  (read  (sell)  LENGTH)  6)) 

(write  (sell)  Arguments 

(create-read-write-segment  (minus  (read  (sell)  LENGTH)  10))) 

(write  (sell)  Index-A  9) 

(write  (sell)  Index-B  0) 

Loop 

(branch-not-zero  (probe  (sell)  (read  (sell)  Index-A)  READ)  Skip) 

(write  (read  (sell)  Arguments)  (read  (sell)  Index-B) 

(read  (sell)  (read  (sell)  Index-A))) 

Skip 

(write  (sell)  Index-A  (plus  (read  (sell)  Index-A)  1)) 

(write  (sell)  Index-B  (plus  (read  (sell)  Index-B)  1)) 

(branch-minus  (comp2U‘e  (read  (sell)  Index-A)  (read  (sell)  LENGTH))  Loop) 
(call-segment  (read  (sell)  Message-Length)  (read  (read  (sell)  Handler)  2) 
(read  (sell)  Arguments)) 

;;  (call-segment  (read  (sell)  Message-Length)  (read  (sell)  Handler) 

;;  (read  (sell)  Arguments)) 

(destroy-segment  (read  (sell)  Arguments)) 

(destroy-segment  (sell))) 

; ;  This  h2uidler  lookups  a  non-resident  handler.  It  is  called  when  a 
;;  handler  is  not  present  in  the  current  handler  map.  It  lirst  looks 
; ;  to  see  il  the  handler  has  been  requested  by  checking  the  requested 
;;  handler  set.  Il  the  handler  has  not  been  requested,  it  creates  a 
; ;  d-sync  lor  luture  requests  and  inserts  it  into  the  requested 
;;  handler  set.  It  then  looks  up  the  relerence  handler  location,  and 
;;  requests  the  handler.  Then  it  waits  on  the  d-sync  it  created.  Il 
; ;  the  handler  had  already  been  requested,  the  task  waits  on  the 
;;  d-sync  in  the  requested  handler  set.  This  d-sync  is  written  by  the 
;;  hamdler  which  replies  the  requested  handler. 

(deline-handler  Lookup-Handler  (Name)  (Handler  Location) 

(write  (sell)  Handler  (match  (read  (nodals)  Request ed-Handlers) 

(read  (sell)  Name))) 

(branch-not-zero  (compare  (read  (sell)  "rudler)  UNBOUND)  Handler-Requested) 
(write  (sell)  Handler  (create-read-write  sgment  1)) 
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(insert  (read  (nodals)  Requested-Handlers) 

(read  (sell)  Nane)  (read  (sell)  Handler)) 

: :  (print-user  ""ftletching  handler  "a  on  node  ~d~t" 

;;  (read  (sell)  Name)  (node-id)) 

(erite  (sell)  Location 

(call  6  Hatch  (read  (sell)  Name)  (read  (nodals)  Node-Client))) 
(send  (read  (sell)  Location)  7  NORMAL  Fetch-Handler 
(read  (sell)  Name)  (node-id)) 

Heoxdler-Requested 

(read  (read  (sell)  Handler)  0) 

(destroy-segment  (sell))) 

; ;  This  handler  letches  a  handler  Irom  a  relerence  node.  That  segment 
;;  is  then  sent  to  the  caller. 

(deline-handler  Fetch-Handler  (Name  Reply-Node)  (Handler  Size) 

(write  (sell)  Handler  (match  (read  (nodals)  Relerence-Handlers) 

(read  (sell)  Name))) 

(write  (sell)  Size  (plus  (read  (read  (sell)  Handler)  HANDLER-LENGTH)  6)) 
(send-segment  (read  (sell)  Reply- ^ode)  (read  (sell)  Size)  NORMAL 
Reply-Handler  (read  (sell)  Handler)) 

(destroy-segment  (sell))) 

; ;  This  handler  installs  a  handler  into  the  current  hemdler  map  on  the 
;;  local  node.  Handler  requiring  the  new  non-resident  handler  are 
; ;  reactivated  when  this  handler  writes  the  entry  in  the  requested 
;;  handler  set. 

(deline-handler  Reply-Handler  (Handler-Length  Handler-Ncune  : VARIABLE) 

(Handler  Index- A  Indez-B) 

(write  (sell)  Handler 

(create-read-write-segment  (read  (sell)  Handler-Length))) 

(write  (sell)  Index-A  Handler-Length) 

(write  (sell)  Index-B  0) 

Loop 

(write  (read  (sell)  Handler)  (read  (sell)  Index-B) 

(read  (sell)  (read  (sell)  Index-A))) 

(write  (sell)  Index-A  (plus  (read  (sell)  Index-A)  1)) 

(write  (sell)  Index-B  (plus  (read  (sell)  Index-B)  1)) 

(branch-minus  (compare  (read  (sell)  Index-A)  (read  (sell)  LENGTH))  Loop) 
(insert  (read  (nodeds)  Current-Handlers) 

(read  (sell)  Handler-Name)  (read  (sell)  Handler)) 

(write  (sell)  Index-A  (match  (read  (nodals)  Requested-Handlers) 

(read  (sell)  Handler-Name))) 

(write  (read  (sell)  Index-A)  0  HANDLER-PRESENT) 

(remove  (read  (nodals)  Requested-Handlers)  (read  (sell)  Hcindler-Ncune) ) 
(destroy-segment  (sell))) 

; ;  This  handler  adds  a  segment  into  the  relerence  h2mdler  map  on  this 
;;  node.  It  also  adds  the  haindler  name  to  the  tremslation  system  via 
;;  the  node  client.  The  length  ol  the  handler  segment  is  computed 
; ;  by  subtracting  the  number  ol  locals  and  other  parameters  (plus  3) 

; ;  Irom  the  message  length  lield.  Two  is  added  to  the  handler  segment 
;;  lor  the  length  and  name  slots. 

(deline-handler  Add-Handler  (Name  Reply-Node  Reply-Segment  Reply-Ollset 

rVARIABLE)  (Size  Index-A  Index-B  B2mdler) 
(call  7  Insert  (read  (sell)  Name)  (node-id)  (read  (nodals)  Node-Client)) 
(writs  (sell)  Size  (minus  (read  (sell)  LENGTH)  9)) 

(write  (sell)  Handler  (create-read-write-segment  (read  (sell)  Size))) 
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(srite  (read  (self)  Handler)  HAHOLER-LEKGTH  (read  (self)  Size)) 

(vrite  (read  (sell)  Hemdler)  HAHDLER-MAME  (read  (sell)  Name)) 

(vrite  (sell)  Indez-A  :VARIABLE) 

(srite  (sell)  Indez-B  2) 

Loop 

(erite  (read  (sell)  Handler)  (read  (sell)  Indez-B) 

(read  (sell)  (read  (sell)  Indez-A))) 

(vrite  (sell)  Indez-A  (plus  (read  (sell)  Indez-A)  1)) 

(vrite  (sell)  Indez-B  (plus  (read  (sell)  Indez-B)  1)) 

(branch-plus  (compare  (read  (sell)  Indez-A)  (read  (sell)  LEHGTH))  Loop) 
(insert  (read  (nodals)  Relerence-Handlers) 

(read  (sell)  Hame)  (read  (sell)  Handler)) 

(send  (read  (sell)  Reply-Hode)  6  HORNAL  Adjust-Barrier 

-1  (read  (sell)  Reply-Segment)  (read  (sell)  Reply-Ollsst)) 
(destroy-segment  (sell))) 

; ;  This  handler  initializes  the  non-resident  handler  mechanism.  It 
; ;  initializes  the  translation  system  and  creates  an  client  on  every 
;;  node.  It  also  creates  reference  and  current  non-resident  handler 
; ;  maps  on  each  node  plus  the  requested  handler  set.  The  reference 
; ;  map  maintains  the  sole  reference  copy  ol  handlers  stored  on  that 
; ;  node.  The  current  handler  map  maintains  handlers  that  have  been 
; ;  temporarily  cached  lor  use  on  that  node.  The  requested  handler  set 
;;  includes  all  handlers  that  have  been  requested  on  the  node, 
(deline-handler  Initialize-NR-Handlers  (Reply-Hode  Reply-Segment 

Reply-Offset)  (Agent-Hode 
Agent-Segment  Indez  Barrier) 

(send  (another-node-id  AHY)  9  NORMAL  Initiadize-Translations 
(node-id)  (sell)  Agent-Hode  Agent-Segment) 

(attribute  (sell)  Barrier  B-SYkC) 

(adjust-count  (sell)  Barrier  HUMBER-OF-HODES) 

(vrite  (sell)  Indez  HUMBER-OF-NODES) 

Loop 

(vrite  (sell)  Indez  (minus  (read  (sell)  Indez)  1)) 

(send  (read  (sell)  Indez)  9  NORMAL  Create-Node-Maps 

(read  (sell)  Agent-Node)  (read  (sell)  Agent-Segment) 

(node-id)  (sell)  Barrier) 

(branch-not-zero  (read  (sell)  Indez)  Loop) 

(test-count  (sell)  Barrier) 

(send  (read  (sell)  Reply-Node)  6  NORMAL  Adjust-Barrier 

-1  (read  (sell)  Reply-Segment)  (read  (sell)  Reply-Ollset)) 
(destroy-segment  (sell))) 

; ;  This  handler  creates  a  translation  client  on  this  node  and  stored  a 
; ;  pointer  to  it  in  a  nodal.  Then  it  creates  the  reference  emd 
; ;  current  handler  maps  and  the  requested  handler  set  and  stores  them 
; ;  in  the  corresponding  nodals . 

(del ine-handler  Create-Node-Maps  (Agent-Node  Agent-Segment  Reply-Node 

Reply-Segment  Reply-Ollset) 

(Client -Segment) 

(send  (node-id)  9  NORMAL  Create-Client 

(read  (sell)  Agent-Node)  (read  (sell)  Agent-Segment) 

(node-id)  (sell)  Client -Segment) 

(vrite  (nodals)  Node-Client  (read  (sell)  Client-Segment)) 

(vrite  (nodals)  Relerence-Handlers 

(create-associative-segment  BANDLER-MAP-SIZE  UNBOUND-SAFE)) 
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(srite  (nodals)  Current-Handlers 

(create-associative-segment  HAHDLER-NAP-SIZE  UHBOUID-SAFE) ) 
(write  (nodals)  Requested-Handlers 

(create-associative-segment  HAHDLER-NAP-SIZE  UNBOUND-SAFE)) 
(send  (read  (self)  Reply-Node)  6  NORMAL  Adjust-Barrier 

-1  (read  (self)  Reply-Segment)  (read  (self)  Reply-Offset)) 
(destroy-segment  (self))) 


A. 9  N-Body  Simulation 


H  -  B  o  d  y 


; ;  Scott  Wills  27  October  1989 

;;  This  is  an  H-Body  simulator  uriten  in  Pi. 

(in-package  ’user) 

; ;  The  mass  of  each  body  if  a  function  of  its  ID: 

I  » 

((body  ID  *  Delta-Mass)  +  Base-Mass) 

(def ine-constant  Base-Mass  1000.0) 

(define-constant  Delta-Mass  100.0) 

; ;  Bodies  are  randomly  position  within  the  box  described  by  points 
;;  (0,0)  and  (Maximum-Position,  Maximum-Position). 

(define-constant  Maximum-Position  1000.0) 

;;  These  offset  define  the  location  of  the  slots  in  Body  objects, 
(define-constant  I-Nodes-Offset  3) 

(define-constant  I-Segments-Offset  4) 

(define-constant  AX-Offset  9) 

(define-constant  AY-Offset  10) 

(define-constant  Barrier-Offset  11) 

;;  These  offset  define  the  location  of  the  slots  in  Interaction  objects, 
(define-constant  ID-A-Offset  16) 

(define-constant  X-A-Offset  3) 

(def ine-const^alt  Y-A-Offset  6) 

(def ine-const2ait  X-B-Offset  4) 

(define-constant  Y-B-Offset  6) 

;;  This  is  the  gravitational  constant. 

(define-constant  G  1.0) 

; ;  This  handler  creates  the  necessary  bodies  and  links  for  the  N  body 
; ;  simulation.  First  the  bodies  are  created  (with  the  nodes  and 
;;  segments  being  stored  in  2irraies).  Then  each  interaction  between 
; ;  the  bodies  is  generated.  The  simulation  is  self-starting. 

(def ine-handler  Initialize-System  (Number-Bodies  Last-Tick)  (Body-Nodes 

.Body-Segments  N-1  Index-A  Index-B  Node) 
(branch-not-minus  (minus  (read  (self)  Htunber-Bodies )  2)  Skip) 

(write  (self)  Number-Bodies  2) 

Skip 

;;  (print-user  "■ft(Data-Size  'd  *d)"t"  (read  (self)  Number-Bodies) 
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::  (read  (sell)  Last-Tick)) 

(write  (sell)  Body-Nodes 

(create-read -write-segment  (read  (sell)  Number-Bodies))) 

(write  (sell)  Body-Segments 

(create-read-write-segment  (read  (sell)  Number-Bodies))) 

(write  (sell)  N-1  (minus  (read  (sell)  Number-Bodies)  1)) 

(write  (sell)  Indez-A  0) 

Loop-A 

(write  (sell)  Node  (another-node-id  ANY)) 

(write  (read  (sell)  Body-Nodes)  (read  (sell)  Indez-A)  (read  (sell)  Node)) 
(send  (read  (sell)  Node)  19  NORMAL  Start-Body  (read  (sell)  Indez-A) 

(read  (sell)  N-1)  (read  (sell)  Last-Tick) 

(node-id)  (read  (sell)  Body-Segments)) 

(write  (sell)  Indez-A  (plus  (read  (sell)  Indez-A)  1)) 

(branch-not-zero  (compare  (read  (sell)  Number-Bodies) 

(read  (sell)  Indez-A))  Loop-A) 

(write  (sell)  Indez-A  0) 

(write  (sell)  Indez-B  1) 

Loop-B 

(send  (another-node-id  ANY)  23  NORMAL  Start-Interaction 
(read  (sell)  Indez-A)  (read  (sell)  Indez-B) 

(read  (read  (sell)  Body-Nodes)  (read  (sell)  Indez-A)) 

(read  (read  (sell)  Body-Nodes)  (read  (sell)  Indez-B)) 

(read  (read  (sell)  Body-Segments)  (read  (sell)  Indez-A)) 

(read  (read  (sell)  Body-Segments)  (read  (sell)  Indez-B)) 

(read  (sell)  Last-Tick)) 

(write  (sell)  Indez-B  (plus  (read  (sell)  Indez-B)  1)) 

(branch-not-zero  (compare  (read  (sell)  Number-Bodies) 

(read  (sell)  Indez-B))  Loop-B) 

(write  (sell)  Indez-A  (plus  (read  (sell)  Indez-A)  1)) 

(write  (sell)  Indez-B  (plus  (read  (sell)  Indez-A)  1)) 

(branch-not-zero  (compare  (read  (sell)  N-1)  (read  (sell)  Indez-A))  Loop-B) 
(destroy-segment  (read  (sell)  Body-Nodes)) 

(destroy-segment  (read  (sell)  Body-Segments)) 

(destroy-segment  (sell))) 

;;  This  handler  initializes  and  starts  a  body.  The  bodies  storage 
; ;  is  created  when  this  handler  is  invoked,  so  all  the  parameters  are 
;;  initialized  automatically.  The  interaction  node  and  segment  arrays 
;;  are  created  lirst.  Alter  that,  the  body  segment  pointer  can  be 
; ;  replied  to  the  system  initialization  prograun.  The  X  and  Y  position  is 
;;  randomly  selected  (within  specilied  limits).  The  body  velocity  aind 
;;  acceleration  aure  initially  zero.  The  tick  count  is  set  to  zero.  At 
;;  this  point,  the  handler  enters  the  update/adjust  loop.  First  it 
;;  sends  it  location  to  all  interactions,  then  it  wait  lor  the 
;;  acceleration  updates  to  be  received  (via  a  B-SYNC).  Then  it  adjusts 
;;  its  velocity  amd  position,  increments  the  tick,  and  repeats  (unless 
;;  the  last  tick  has  been  reached). 

(deline-handler  Start-Body  (ID  Count  Last-Tick  Reply-Node  Reply-Segment) 

(I-Nodes  I-Segments  X  Y  VX  VY  AX  AY 
Bairrier  Tick  Indez) 

(write  (sell)  I-Nodes  (create-read-write-segment  (read  (sell)  Count))) 
(write  (sell)  I-Segments  (create-read-write-segment  (read  (sell)  Count))) 
(send  (read  (sell)  Reply-Node)  6  NORMAL  Reply-Vadue  (sell) 

(read  (sell)  Reply-Segment)  (read  (sell)  ID)) 
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(srite  (sell)  X  (random  Maximum-Position)) 

(srite  (self)  Y  (random  Maximum-Position)) 

(srite  (sell)  VX  0) 

(write  (sell)  VY  0) 

(attribute  (sell)  Barrier  B-SYIC) 

(write  (sell)  Tick  0) 

;;  (print-user  •■■k(Body-Position  "d  "d  "6,11  "6,11)"*"  (read  (sell)  Tick) 

;;  (read  (sell)  ID)  (read  (sell)  X)  (read  (sell)  Y)) 

Loop-A 

(write  (sell)  AX  0) 

(write  (sell)  AY  0) 

(adjust-count  (sell)  Barrier  (read  (sell)  Count)) 

(write  (sell)  Index  0) 

Loop-B 

(send  (read  (read  (sell)  I-Nodes)  (read  (sell)  Index)) 

7  NORMAL  Update-Position 

(read  (sell)  X)  (read  (sell)  Y)  (read  (sell)  ID) 

(read  (read  (sell)  I-Segments)  (read  (sell)  Index))) 

(write  (sell)  Index  (plus  (read  (sell)  Index)  1)) 

(branch-not-zero  (compare  (read  (sell)  Index)  (read  (sell)  Count))  Loop-B) 
(test-count  (sell)  Barrier) 

(write  (sell)  X  (plus  (read  (sell)  X)  (read  (sell)  VX))) 

(write  (sell)  Y  (plus  (read  (sell)  Y)  (read  (sell)  VY))) 

(write  (sell)  VX  (plus  (read  (sell)  VX)  (read  (sell)  AX))) 

(write  (sell)  VY  (plus  (read  (sell)  VY)  (read  (sell)  AY))) 

(write  (sell)  Tick  (plus  (read  (sell)  Tick)  1)) 

::  (print-user  "■*;(Body-Position  *d  'd  '5,11  "S.lD'k"  (read  (sell)  Tick) 

;;  (read  (sell)  ID)  (read  (sell)  X)  (read  (sell)  Y)) 

(branch-not-zero  (read  (sell)  ID)  Skip) 

(branch-not-zero  (mod  (read  (sell)  Tick)  6)  Skip) 

(print-user  ■'*4d  ”  (read  (sell)  Tick)) 

(branch-not-zero  (mod  (read  (sell)  Tick)  60)  Skip) 

(print-user  "*ft") 

Skip 

(branch-not-zero  (compare  (read  (sell)  Tick)  (read  (sell)  Last-Tick)) 
Loop-A) 

(destroy-segment  (read  (sell)  I-Nodes)) 

(destroy-segment  (read  (sell)  I-Segments)) 

(destroy-segment  (sell))) 

;  ;  This  handler  creates  and  starts  an  interaction.  It  begins  by 
;;  attributing  the  position  slots  ol  A  and  B  bodies.  It  then  calculates 
;;  the  mass  ol  the  bodies.  The  handler  then  sends  a  message  to  the  two 
; ;  bodies  setting  this  interaction’s  location.  For  the  B  body,  the  A 
;;  body’s  ID  is  used  as  the  interaction  identification.  For  the  A  body, 

; ;  one  minus  B’s  ID  is  used  (since  no  body  has  an  interaction  with 
;;  itself).  This  also  starts  the  simulation  on  the  body  objects,  and 
;;  initializes  the  tick.  At  this  point,  the  interaction  begins  the 
;;  simulation  loop  by  calculating  the  delta  distances.  This  will  lock 
;;  on  the  S-SYNCs  until  the  values  are  updated  by  bodies.  The  force 
; ;  between  the  two  bodies  is  then  calculated,  and  the  acceleration 
; :  component  on  each  body  is  determined.  These  components  are  sent  to 
; ;  the  bodies  and  the  process  begins  again  (until  the  last  tick  is 
;;  reached).  The  following  formulas  is  used. 
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::  AI-A  =  K  ♦  M-B  *  (X-A  -  X-B)  /  ((X-B  -  X-A)‘2  +  (Y-B  -  Y-A)-2)-1.5 

::  AY-A  =  K  *  M-B  *  (Y-A  -  Y-B)  /  ((X-B  -  X-A)‘2  +  (Y-B  -  Y-A)‘2)-1.5 

AX-B  =  K  ♦  M-A  *  (X-B  -  X-A)  /  ((X-B  -  X-A)*2  +  (Y-B  -  Y-A)-2)-1.5 

::  AY-B  =  K  *  M-A  ♦  (Y-B  -  Y-A)  /  ((X-B  -  X-A)-2  +  (Y-B  -  Y-A)-2)-1.5 

(deline-handler  Start-Interaction  (ID-A  ID-B  Hode-A  lode-B  Segment-A 

Segment-B  Last-Tick)  (X-A  X-B  Y-A  Y-B 
Mass-A  Mass-B  Tick  DX  DY  FX  FY  AX  AY) 

(attribute  (sell)  X-A  S-SYHC) 

(attribute  (sell)  Y-A  S-SYBC) 

(attribute  (sell)  X-B  S-SYBC) 

(attribute  (sell)  Y-B  S-SYHC) 

(srite  (sell)  Mass-A  (times  (read  (sell)  ID-A)  Delta-Mass)) 

(vrite  (sell)  Nass-A  (plus  (read  (sell)  Hass-A)  Base-Mass)) 

(vrite  (sell)  Mass-B  (times  (read  (sell)  ID-B)  Delta-Mass)) 

(urite  (sell)  Mass-B  (plus  (read  (sell)  Mass-B)  Base-Mass)) 

(write  (sell)  ID-B  (minus  (read  (sell)  ID-B)  1)) 

(send  (read  (sell)  Node-A)  9  NORMAL  Vrite-Interaction-Location 

(read  (sell)  ID-B)  (node-id)  (sell)  (read  (sell)  Segment-A)) 

(send  (read  (sell)  Hode-B)  9  HORMAL  Write-Interaction-Location 

(read  (sell)  ID-A)  (node-id)  (sell)  (read  (sell)  Segment-B)) 

(write  (sell)  Tick  0) 

Loop-A 

(write  (sell)  DX  (minus  (read  (sell)  X-B)  (read  (sell)  X-A))) 

(write  (sell)  DY  (minus  (read  (sell)  Y-B)  (read  (sell)  Y-A))) 

(write  (sell)  FX  (times  (read  (sell)  DX)  (read  (sell)  DX))) 

(write  (sell)  FY  (times  (read  (sell)  DY)  (read  (sell)  DY))) 

(write  (sell)  FX  (plus  (read  (sell)  FX)  (read  (sell)  FY))) 

(write  (sell)  FX  (exponent  (read  (sell)  FX)  -1.5)) 

(write  (sell)  FX  (times  (read  (sell)  FX)  G)) 

(write  (sell)  FY  (times  (read  (sell)  FX)  (read  (sell)  DY))) 

(write  (sell)  FX  (times  (read  (sell)  FX)  (read  (sell)  DX))) 

(write  (sell)  AX  (times  (read  (sell)  FX)  (read  (sell)  Mass-B))) 

(write  (sell)  AY  (times  (read  (sell)  FY)  (read  (sell)  Mass-B))) 

(send  (read  (sell)  Node-A)  6  NORMAL  Add- Acceleration 

(read  (sell)  AX)  (read  (sell)  AY)  (read  (sell)  Segment-A)) 

(write  (sell)  AX  (times  (read  (sell)  FX)  (read  (sell)  Mass-A))) 

(write  (sell)  AY  (times  (read  (sell)  FY)  (read  (sell)  Mass-A))) 

(write  (sell)  AX  (times  (read  (sell)  AX)  -1)) 

(write  (sell)  AY  (times  (read  (sell)  AY)  -1)) 

(send  (read  (sell)  Ncde-B)  6  NORMAL  Add-Acceleration 

(road  (sell)  AX)  (road  (sell)  AY)  (read  (sell)  Segment-B)) 

(write  (sell)  Tick  (plus  (read  (sell)  Tick)  1)) 

(breuich-not-zero  (compare  (read  (sell)  Tick)  (read  (sell)  Last-Tick)) 
Loop-A) 

(destroy-segment  (sell))) 

; ;  This  handler  writes  the  interaction  location  in  the  correct  position 
;;  in  the  body's  I-Node  and  I-Segment  arrays. 

(deline-handler  Write-Interaction-Location  (I  I-Node  I-Segment  Body-Segment) 

(I-Nodes  I-Segments) 

(write  (sell)  I-Nodes  (read  (read  (sell)  Body-Segment)  I-Nodes-Ollset)) 
(write  (sell)  I-Segments 

(read  (read  (sell)  Body-Segment)  I-Segments-Ollset) ) 

(write  (read  (sell)  I-Hodes)  (read  (sell)  I)  (read  (sell)  I-Node)) 

(write  (read  (sell)  I-Segmonts)  (read  (sell)  I)  (read  (sell)  I-Segment)) 
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(destroy-segment  (sell))) 

;;  This  handler  updates  the  bodies  position  on  an  interaction.  The  body 
;;  ID  is  used  to  determine  whether  the  update  is  body  A  or  B. 

(deline-handler  Update-Position  (X  Y  ID  I-Segnent)  () 

(branch-not-zero  (compare  (read  (read  (sell)  I-Segment)  ID-A-Ollset) 

(read  (sell)  ID))  Update-Body-B) 

(write  (read  (sell)  I-Segment)  X-A-Ollset  (read  (sell)  X)) 

(write  (read  (sell)  I-Segment)  Y-A-Ollset  (read  (sell)  Y)) 

(destroy-segment  (sell)) 

Update-Body-B 

(write  (read  (sell)  I-Segment)  X-B-Ollset  (read  (sell)  X)) 

(write  (read  (sell)  I-Segment)  Y-B-Ollset  (read  (sell)  Y)) 

(destroy-segment  (sell))) 

; :  This  handler  adds  the  acceleration  component  ol  one  interaction  to 
;;  the  bodies  composite  acceleration  values.  The  bodies  barrier  is 
; :  also  decremented. 

(deline-handler  Add-Acceleration  (AX  AY  Body-Segment)  () 

(write  (read  (sell)  Body-Segment)  AX-Ollset 

(plus  (read  (read  (sell)  Body-Segment)  AX-Ollset)  (read  (sell)  AX))) 
(write  (read  (sell)  Body-Segment)  AY-Ollset 

(plus  (read  (read  (sell)  Body-Segment)  AY-Ollset)  (read  (sell)  AY))) 
(adjust-count  (read  (sell)  Body-Segment)  Barrier-Ollset  -1) 

(destroy-segment  (sell))) 


A.  10  Relaxation 


;;  Relaxation 

»  $ 

; ;  Scott  Wills  23  February  1990 

»  * 

;;  This  is  a  2d  rel2ucation  simulator  writen  in  Pi. 

(in-package  ’user) 

; ;  This  constant  delines  the  array  dimensions 
(deline-constant  X-Size  10) 

(deline-constant  Y-Size  10) 

;;  This  constant  delines  the  number  ol  nodes  in  the  system, 
(deline-consteoit  Humber-Ol-Iodes  64) 

; ;  This  constant  delines  the  initial  size  ol  the  element  map 
(deline-constant  ELEMEKT-MAP-SIZE  2) 

;;  These  constants  deline  neighbor  temperature  ollsets  in  an  element, 
(deline-constant  NORTH-OFFSET  7) 

(deline-constzmt  EAST-OFFSET  8) 

(deline-constant  WEST-OFFSET  9) 

(deline-constant  SOOTH-OFFSET  10) 

;;  These  constants  deline  the  boundary  temperatures. 

(deline-constant  MORTH-BORDER-TEMP  100.0) 

(deline-constant  EAST-BORDER-TEKP  1000.0) 

(deline-constant  WEST-BORDER-TEMP  1000.0) 
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(deline-constauit  SOUTH-BORDER-TENP  100.0) 

; ;  This  nodal  contains  the  element  map;  eoi  associative  set  oi  elements 
; ;  accessed  by  their  index. 

(def ine-nodal  Element-Map) 

; ;  This  handler  creates  the  relaxation  elements  in  the  system.  Each 
;;  element  is  assigned  a  index  equatl  its  position  in  the  array.  A 
; ;  elements  index  =  (Y*Y-Size  +  X-Size) .  A  element  is  created  on  node 
;;  (mod  index  Number-Of-Vodes).  Each  element  performs  a  fixed  niusber  of 
;;  iterations  (defined  by  Last-Tick),  then  returns  its  final  temperature  to 
;;  an  array  created  by  this  task.  This  task  then  prints  out  the  final 
; ;  results . 

(define-handler  Relax  (Last-Tick)  ( lumber-Elements  Index  Mode  Results) 
(write  (self)  Humber-Elements  (times  X-Size  Y-Size)) 

(write  (self)  Results 

(create-read-write-segment  (read  (self)  Humber-Elements))) 

(write  (self)  Index  0) 

Start-Loop 

(write  (self)  Mode  'mod  (read  (self)  Index)  Humber-Of-Hodes ) ) 

(send  (read  (self)  Node)  21  NORMAL  Start-Element  (read  (self)  Index) 

(read  (self)  Last-Tick)  (node-id)  (read  (self)  Results)) 

(write  (self)  Index  (plus  (read  (self)  Index)  1)) 

(branch-minus  (compare  (read  (self)  Index)  (read  (self)  Number-Elements)) 
St art -Loop) 

(write  (self)  Index  (times  X-Size  Y-Size)) 

(write  (self)  Index  (minus  (read  (self)  Index)  X-Size)) 

(print -user  ""A") 

Print-Loop 

(print-user  "'5, If  •'  (read  (read  (self)  Results)  (read  (self)  Index))) 
(write  (self)  Index  (plus  (read  (self)  Index)  1)) 

(branch-not-zero  (mod  (read  (self)  Index)  X-Size)  Print-Loop) 

(print-user  "*#”) 

(write  (self)  Index  (minus  (read  (self)  Index)  (times  2  X-Size))) 
(branch-not-minus  (read  (self)  Index)  Print-Loop) 

(destroy-segment  (read  (self)  Results)) 

(destroy-segment  (self))) 

; ;  This  handler  starts  by  initializing  the  element  object  and 
; ;  installing  it  in  the  element  map.  It  also  determines  whether  it  is 
;;  a  boundary  element  and  makes  the  appropriate  adjustment.  S-Syncs 
; ;  are  the  basis  of  this  example.  Four  s-syncs  are  created  for  the 
;;  north,  south,  east,  and  west  neighbors.  The  element  task  then 
;;  enters  a  loop  of  sending  its  value  to  its  neighbors,  then  averaging 
;;  the  its  neighbor’s  values  to  compute  it’s  new  V2d.ue. 

(define-handler  Start-Element  (Index  Last-Tick  Reply-Node  Reply-Segment) 

(N-Index  E-Index  U-Index  S-Index 
N-Temp  E-Temp  W-Temp  S-Temp 
N-Hode  E-Node  W-Node  S-Node 
Temp  Tick) 

(write  (self)  N-Index  (plus  (read  (self)  Index)  X-Size)) 

(write  (self)  E-Index  (plus  (read  (self)  Index)  1)) 

(write  (self)  V-Index  (minus  (read  (self)  Index)  1)) 

(write  (self)  S-Index  (minus  (read  (self)  Index)  X-Size)) 

(attribute  (self)  N-Temp  S-SYNC) 

(attribute  (self)  E-Temp  S-SYNC) 

(attribute  (self)  W-Temp  S-SYNC) 
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(attribute  (self)  S-Temp  S-SYHC) 

(write  (self)  I-Hode  (mod  (read  (self)  H-Index)  HUMBER-OF-IODES) ) 

(write  (self)  E-Iode  (mod  (read  (self)  E-Indez)  NUNBER-OF-NODES) ) 

(write  (self)  W-Hode  (mod  (read  (self)  W-Index)  HUMBER-OF-HODES) ) 

(write  (self)  S-Node  (mod  (read  (self)  S-Index)  HUMBER-OF-MODES) ) 

(write  (self)  Temp  0.0) 

(write  (self)  Tick  0) 

(branch-minus  (compare  (read  (self)  Index) 

(times  X-Size  (minus  Y-Size  1)))  Skip-A) 
(write  (self)  I-Index  BORDER) 

(attribute  (self)  B-Ternp  D-SYHC) 

(write  (self)  I-Temp  HORTH-BORDER-TEMP) 

Skip-A 

(branch-not-zero  (compare  (mod  (read  (self)  Index)  X-Size) 

(minus  X-Size  1))  Skip-B) 

(write  (self)  E-Index  BORDER) 

(attribute  (self)  E-Temp  D-SYHC) 

(write  (self)  E-Temp  EAST-BORDER-TEMP) 

Skip-B 

(branch-not-minus  (compare  (read  (self)  Index)  X-Size)  Skip-C) 

(write  (self)  S-Index  BORDER) 

(attribute  (self)  F-Temp  D-SYHC) 

(write  (self)  S-Temp  SOUTH-BORDER-TEMP) 

Skip-C 

(branch-not-zero  (mod  (read  (self)  Index)  X-Size)  Skip-D) 

(write  (self)  U-Index  BORDER) 

(attribute  (self)  W-Temp  D-SYHC) 

(write  (self)  W-Temp  WEST-BORDER-TEMP) 

Skip-D 

(branch-zero  (probe  (nodals)  Element-Map  READ)  Map-Exists) 

(write  (nodals)  Element-Map 

(create-associative-segment  ELEMEHT-MAP-SIZE  UHBOUHD-SAFE)) 
Map-Exists 

(insert  (read  (nodeds)  Element-Map)  (read  (self)  Index)  (self)) 
Relax-Loop 

(branch-zero  (compare  (read  (self)  H-Index)  BORDER)  Skip-E) 

(send  (read  (self)  H-Hode)  7  NORMAL  Update-Temp 

(read  (self)  K-Index)  SOUTH-OFFSET  (read  (self)  Temp)) 

Skip-E 

(branch-zero  (compare  (read  (self)  E-Index)  BORDER)  Skip-F) 

(send  (read  (self)  E-Hode)  7  NORMAL  Update-Temp 

(read  (self)  E-Index)  WEST-OFFSET  (read  (self)  Temp)) 

Skip-F 

(breoich-zero  (compeo'e  (read  (self)  W-Index)  BORDER)  Skip-G) 

(send  (read  (self)  W-Node)  7  NORMAL  Update-Temp 

(read  (self)  W-Index)  EAST-OFFSET  (read  (self)  Temp)) 

Skip-G 

(branch-zero  (compare  (read  (self)  S-Index)  BORDER)  Skip-H) 

(send  (read  (self)  S-Node)  7  NORMAL  Update-Temp 

(read  (self)  S-Index)  NORTH-OFFSET  (read  (self)  Temp)) 

Skip-H 

(write  (self)  Temp  (plus  (read  (self)  H-Temp)  (read  (self)  E-Temp))) 

(write  (self)  Temp  (plus  (read  (self)  Temp)  (read  (self)  W-Temp))) 

(write  (self)  Temp  (plus  (read  (self)  Temp)  (read  (self)  S-Temp))) 
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(vrite  (sell)  Temp  (divide  (read  (sell)  Temp)  4)) 

(srite  (sell)  Tick  (plus  (read  (sell)  Tick)  1)) 

(branch-minus  (compare  (read  (sell)  Tick)  (read  (sell)  Last-Tick)) 
Relax-Loop) 

(send  (read  (sell)  Reply-Hode)  6  HORMAL  Reply-Value 

(read  (sell)  Temp)  (read  (sell)  Reply-Segment)  (read  (sell)  Index)) 
(destroy-segment  (sell))) 

; ;  This  handler  deposites  a  temperature  in  the  appropriate  slot  ol  an 
; ;  element . 

(deline-handler  Update-Temp  (Index  Ollset  Temp)  (Element) 

Wait-Loop 

(vrite  (sell)  Element  (match  (read  (nodals)  Element-Map) 

(read  (sell)  Index))) 

(branch-not-zero  (compare  (read  (sell)  Element)  UNBOURD)  Write-Temp) 
(suspend) 

(branch-zero  0  Wait-Loop) 

Write-Temp 

(write  (read  (sell)  Element)  (read  (sell)  Ollset)  (read  (sell)  Temp)) 
(destroy-segment  (sell))) 
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Appendix  B 

Example  Execution  Logs 


This  appendix  contains  the  execution  logs  for  the  examples  described  in 


B.l  Shared  Memory  With  Caches 


12  June  1990  0:26:30pD  sm-test  PiSim  version  0.1 


task  type  profile 

ACK . 

ACK-FLUSH . 

ADJUST-BARRIER . 

FLUSH . 

GET-CACHE-LINE . 

INVALIDATE . 

NODE-SETUP . 

PRINT-COUNT . 

READ . 

READ-ADDRESS . 

REMOTE-TEST . 

REMOTE-WRITE . 

REMOVE . 

REPLY . 

SHARED-MEMORY-SETUP 

TEST . 

TEST-1 . 

TEST-2 . 

TEST-3 . 

TEST-4 . 

TEST-5 . 

TEST-6 . 

TEST-7 . 

TEST-LOCATION . 

WRITE . 


.2,181  (  6.6*/,) 
..778  (  2.47.) 

..593  (  1.87.) 

..778  (  2.47.) 

1,560  (  4.77.) 

..780  (  2.47.) 

...64  (  0.27.) 

8,192  (  24.87.) 
..782  (  2.47.) 

4,368  (  13.27.) 
..268  (  0.87.) 
..261  (  0.87.) 
1,323  (  4.07.) 
1,560  (  4.77.) 

- 1  (  0.07.) 

- 1  (  0.07.) 

- 1  (  0.07.) 

....1  (  0.07.) 

- 1  (  0.07.) 

....1  (  0.07.) 
—  1  (  0.07.) 

- 1  (  0.07.) 

....1  (  0.07.) 
4,367  (  13.27.) 
..778  (  2.47.) 


chapter  3. 


173 


WRITE-ADDRESS . 4,362  (  13.2*/,) 

total . 33,004  (100.0*/,) 


task  status  profile 

CALL . 22,851  (  35.9*/,) 

HEW . 10,153  (  16.0*/,) 

WAITIHG . 30,604  (  48.1*/,) 

total . 63,608  (100.0*/,) 


task  run  time  profile 

9-16 . 8,785  (  26.6*/,) 

17-32 . 7,237  (  21.9*/,) 

33-64 . 11,391  (  34.6*/.) 

65-128 . 5,505  (  16.7*/,) 

129-256 . 19  (  0.1*/,) 

513-1,024 . 1  (  0.0*/,) 

1.025-2,048 . 64  (  0.2*/,) 

8,193-16,384 . 1  (  0.0*/,) 

262,145-524,288 . 1  (  0.0*/,) 

total . 33,004  (100.0*/.) 


task  uait  time  profile 

0 . 20,826  (  63.1*/,) 

1  . 9  (  0.0*/.) 

2  . 3  (  0.0*/,) 

3-4 . 9  (  0.0*/,) 

5-8 . 34  (  0.1*/,) 

9-16 . 39  (  0.1*/,) 

17-32 . 40  (  0.1*/,) 

33-64 . 4,502  (  13.6*/,) 

65-128 . 917  (  2.8*/,) 

129-256 . 1,305  (  4.0*/,) 

257-512 . 2,762  (  8.4*/,) 

513-1,024 . 403  (  1.2*/,) 

1,025-2,048 . 336  (  1.0*/,) 

2,049-4,096 . 568  (  1.7*/,) 

4,097-8,192 . 726  (  2.2*/,) 

8,193-16,384 . 521  (  1.6*/,) 

16,385-32,768 . 2  (  0.0*/,) 

32,769-65,536 . 1  (  0.0*/.) 

524,289-1,048,576 . 1  (  0.0*/,) 

total . 33,004  (100.0*/,) 


instruction  type  profile 

A-SHIFT . 11,684  (  3.5*/,) 

ADJUST-COUIT . 12,522  (  3.8*/,) 

AID . 29,144  (  8.8*^) 

AHOTHER-IODE-ID . 3  (  0.0*/,) 

ATTRIBUTE . 9,405  (  2.9*/,) 

BRAHCH-IOT-ZERO . 41,944  (  12.77.) 

BRAICH-ZERO . 19,036  (  5.8*/,) 

CALL . 45,702  (  13.9*/,) 

CREATE- ASSOCI ATI VE-SEGMEIT . 64  (  0.0*/.) 

CREATE-READ-WRITE-SEGMEIT . 702  (  0.2*/,) 
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DESTROY-SEGMEMT 

IHITIALIZE . 

USER! . 

MATCH . 

MIIUS . 

MOVE . 

PLUS . 

PRIIT-USER . 

REMOVE . 

RETURI . 

SEID . 

TEST-COUMT . 

TIMES . 

total . 


.33,004  ( 

o 

d 

.13,901  ( 

4.27.) 

..4,806  ( 

1.57.) 

.13,408  ( 

4.17.) 

..6,721  ( 

2.07.) 

.40,450  ( 

12.37.) 

..9.223  ( 

2.87.) 

....104  ( 

0.07.) 

..1,660  ( 

0.67.) 

..6,928  ( 

1.87.) 

.10,152  ( 

3.17.) 

.11,704  ( 

3.67.) 

..8,192  { 

2.57.) 

329,359  (100.07.) 

instructions 

per 

messages 

32.4 

instructions 

per 

task 

10.0 

instructions 

run 

length 

8.1 

operation  type  profile 

A-SHIFT . 

ADJUST-COUNT . 

AND . 

AHOTHER-HODE-ID . 

ATTRIBUTE . 

BRANCH-NOT-ZERO . 

BRANCH-ZERO . 

CALL . 

COMPARE . 

CREATE-ASSOCIATIVE-SEGMENT 
CREATE-READ-WRITE-SEGMENT . 

DESTROY-SEGMENT . 

INSERT . 

MATCH . 

MINUS . 

MOD . 

NODALS . 

NODE-ID . 

PLUS . 

PRINT-USER . 

READ . 

REMOVE . 

RETURN . 

SELF . 

SEID . 

TEST-COUNT . 

TIMES . 

WRITE . 

total . 


...11,684  (  0.8*/,) 

...12,522  (  0.8*/,) 

...29,144  (  1.9*/,) 

. 526  (  0.0*/,) 

_ 9,405  (  0.6*/,) 

...41,944  (  2. 7*/.) 

...19,036  (  1.27.) 

...46,702  (  3.07.) 

...44,020  (  2.97.) 

. 64  (  0.07.) 

. 702  (  0.07.) 

...33,004  (  2.27,) 

- 4,806  (  0.37.) 

...13,408  (  0.97.) 

- 6,721  (  0.47.) 

...16,384  (  1.17.) 

...26,256  (  1.77.) 

- 2,159  (  0.17.) 

- 9,223  (  0.67.) 

. 104  (  0.07.) 

..476,974  (  31.37,) 
....1,560  (  0.17.) 

....5,928  (  0.47.) 

..543,695  {  35.67.) 

...10,162  (  o.r/.) 

...11,704  (  0.87.) 

- 8,192  (  0.57.) 

..141,098  (  9.27,) 

1,526,017  (100.07,) 


operations  per  instruction 


4.6 
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segBont  size  profile 

3  . 

4  . 

5  . 

6  . 

7  . 

8  . 

9 . 

12 . 

13 . 

16 . 

total . 


.  . 

.  .1 

( 

0 

0*/.) 

.8 

,192 

( 

24 

3*/.) 

.2 

,183 

( 

6 

5*/.) 

.3 

,477 

( 

10 

3*/.) 

10 

,295 

( 

30 

6*/.) 

.6 

250 

( 

18 

5*/.) 

.  • 

332 

( 

1 

0*/.) 

.  . 

638 

( 

1 

9*/.) 

.  . 

778 

( 

2 

3*/.) 

.1 

624 

( 

4 

8*/.) 

33 

770 

(100 

0*/.) 

average  segment  size 


6.9 


segment  age  profile 

9-16 . 8.509  (  25.8*/.) 

17-32 . 1,747  (  6. 3’/.) 

33-64 . 12,192  (  36.9*/,) 

65-128 . 2,742  ( 

129-256 . 1,524  ( 

257-512 . 3,604  ( 

513-1,024 . 444  ( 

1,025-2,048 . 398  ( 

2,049-4,096 . 576  ( 

4.097-8,192 . 732  ( 

8,193-16,384 . 532  ( 

16.385-32.768 . 2  ( 

32,769-65,536 . 1  ( 

1,048,577-2,097,152 . 1  ( 


8.3*/.) 
4.6*/.) 
10.9*/.) 
1.3*/.) 
1.2*/.) 
i.r/.) 
2. 2*/.) 
1.6*/.) 
0.0*/,) 
0.0*/,) 
0.0%) 


total . 33,004  (100.0*/,) 


last  tick  at  1,132,376 


simulation  time:  19  minutes  5  seconds 


B.2  Shared  Memory  With  Address  Braiding 


12  June  1990  11:54:21am  sm-nc-test  PiSim  version  0.1 


task  type  profile 
ADJUST-BARRIER. . . 

RODE-SETUP . 

PRIIT-COUIT . 

READ . 

READ-ADDRESS . 


4,966  (  12.4*/,) 
...64  (  0.2*/,) 

8,192  (  20.5*/,) 
4,368  (  10.9*/,) 
4,368  (  10.9*/,) 
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REMOTE  -TEST . 

. 268 

( 

0, 

.7*/.) 

REMOTE-WRITE . 

. 261 

( 

0, 

.7*/.) 

REPLY-VALUE . 

. 4,368 

( 

10, 

.9*/.) 

SHARED-MEMORY-SETUP . 

. . 1 

( 

0. 

.0*/.) 

TEST . 

. 1 

( 

0. 

.0*/.) 

TEST-1 . 

. 1 

( 

0, 

.0*/.) 

TEST-2 . 

. 1 

( 

0. 

.0*/.) 

TEST-3 . 

. 1 

( 

0, 

.0*/.) 

TEST-4 . 

. 1 

( 

0. 

.0*/.) 

TEST-5 . 

. 1 

( 

0. 

.0*/.) 

TEST-6 . 

. 1 

( 

0. 

.0*/.) 

TEST-7 . 

. 1 

( 

0. 

.0*/.) 

TEST-LOCATIOH . 

. 4,367 

( 

10. 

.9*/.) 

WRITE . 

. 4,362 

( 

10. 

.9*/.) 

WRITE- ADDRESS . 

. 4,362 

( 

10, 

.9*/.) 

total . 

. 39,944 

( 

100. 

.0*/.) 

task  status  profile 


CALL . 21,291  (  30.4*/.) 

MEW . 18,653  (  26. 7*/.) 

WAITING . 30,031  (  42.9*/,) 

total . 69,975  (100.0*/,) 


task  run  time  profile 

9-16 . 

. 26 

309 

( 

65 

.9*/.) 

17-32 . 

. 4 

897 

( 

12 

.3*/.) 

33-64 . 

. 8 

733 

( 

21 

.9*/.) 

65-128 . 

.  .2 

( 

0. 

.0*/.) 

513-1,024 . 

.  .1 

( 

0. 

.0*/.) 

8,193-16,384 . 

.  .1 

( 

0 

.0*/.) 

262,145-524,288 . 

.  .1 

( 

0. 

.0*/.) 

total . 

. 39 

944 

(100. 

.0*/.) 

task  salt  time  profile 

0 . 

1 . 

2 . 

3-4 . 

5-8 . 

9-16 . 

17-32 . 

33-64 . 

65-128 . 

129-256 . 

257-512 . 

513-1,024 . 

1,025-2,048 . 

2,049-4,096 . 

4,097-8,192 . 

16,385-32,768 . 

524,289-1,048,576 . 

total . 


25,470  (  63. 8*/.) 
....23  (  0.1*/,) 

. 6  (  0.0*/,) 

- 17  (  0.0*/,) 

_ 25  {  0.1*/,) 

_ 45  (  0.1*/,) 

...121  (  0.3*/,) 

.8,620  (  21.6*/.) 
.4,863  (  12. 2*/.) 
...153  (  0.4*/,) 

. 5  (  0.0*/,) 

- 71  (  0.2*/,) 

. 7  (  0.0*/,) 

...386  (  i.oy.) 
.  .  .131  (  0.3*/,) 

. 1  (  0.0*/,) 

. 1  (  0.0*/,) 

39,944  (100.0*/,) 


instruction  type  profile 
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A-SHIFT . 17,460  (  7.1’/.) 

AD JUST-COUIT . 9,327  (  3 . 8*/. ) 

AID . 26,190  (  10.7*/,) 

AHOTHER-HODE-ID . 3  (  0 . 0‘/. ) 

ATTRIBUTE . 8,737  (  3. 6*/.) 

BRAICH-IOT-ZERO . 25,152  (  10. 3*/.) 

BRAICH-ZERO . 4,367  (  1.8'/,) 

CALL . 42,582  (  17.4'/.) 

CREATE-READ-WRITE-SEGMEIT . 64  (  0.0*/.) 

DESTROY-SEGMEIT . 39,944  (  16. 3y.) 

IIITIALIZE . 1  (  0.0*/,) 

Miius . 65  (  o.oy.) 

MOVE . 8,800  (  3.6y.> 

OR . 8,730  (  3.6*/.) 

PLUS . 8,711  (  3.6*/,) 

PRIIT-USER . 104  (  o.oy.) 

RETURH . 8,736  (  3.6*/,) 

SEID . 18,652  (  7.6*/,) 

TEST-COUIT . 8,744  (  3. 6*/.) 

TIMES . 8,192  (  3.3'/.) 

total . 244,561  (100.0*/.) 


instructions  per  messages  13.1 


instructions  per  task  6.1 


instructions  run  length  5.0 


operation  type  proiile 

A-SHIFT . 17,460  (  1.6'/.) 

AD  JUST-COUIT . 9,327  (  0.9'/,) 

AID . 26,190  (  2.4'/.) 

AIOTHER-IODE-ID . 526  (  0.0'/.) 

ATTRIBUTE . 8,737  (  0.8'/,) 

BRAICH-IOT-ZERO . 25,152  (  2.3'/,) 

BRAICH-ZERO . 4,367  (  0.4*/,) 

CALL . 42,582  (  3.9'/,) 

COMPARE . 13,071  (  1.2*/,) 

CREATE-READ-WRITE-SEGMEIT . 64  (  0.0'/,) 

DESTROY-SEGMEIT . 39,944  (  Z.T/.) 

MIIUS . 65  (  0.0'/,) 

MOD . 16,384  (  1.5'/,) 

lODALS . 8,794  (  0.8'/.) 

■ODE-ID . 9,329  (  0.9'/,) 

OR . 8.730  (  O.sy.) 

PLUS . 8,711  (  0.8y.) 

PRIIT-USER . 104  (  0.0'/,) 

READ . 295,187  (  27. 1'/.) 

RETURI . 4,368  (  0.4%) 

SELF . 431,365  (  39.6'/,) 

SEID . 18,652  (  l.r/.) 

TEST-COUIT . 8,744  (  0.87.) 

TIMES . 8,192  (  0.8X) 

WRITE . 82,683  (  7.6'/,) 
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total 


1,088,628  (100.0*/,) 


oparat’onr  par  instn\otiap 


4.5 


segment  size  profile 


3  . 1  (  0.0*/,) 

4  . 8,192  {  20.6*/.) 

5  . 2  (  0.0*/,) 

6  . 9,326  (  23. 3*/.) 

7  . 13,167  (  32. 9*/.) 

8  . 8,988  (  22.6*/.) 

9  . 268  (  O.r/.) 

1024 . 64  (  0.2*/.) 

total . 40,008  (100.0*/,) 


average  segment  size 


8.0 


segment  age  profile 

9-16 . 

17-32 . 

33-64 . 

65-128 . 

129-256 . 

257-512 . 

513-1,024 . 

1,025-2,048 . 

2,049-4,096 . 

4,097-8,192 . 

8,193-16,384 . 

16,385-32,768 . 

1,048,577-2,097,152 
total . 


25,490  (  63.8*/,) 
...106  (  0.3*/,) 

...106  (  0.3*/,) 

13,243  (  33.2*/,) 
...395  (  1.0*/,) 

. 7  (  0.0*/,) 

- 70  (  0.2*/,) 

. 8  (  0.0*/,) 

...386  (  1.0*/,) 

...130  (  0.3*/,) 

. 1  (  0.0*/,) 

. 1  (  0.0*/,) 

. 1  (  0.0*/,) 

39,944  (100.0*/,) 


last  tick  at  1,167,698 


simulation  time:  17  minutes  50  seconds 


B.3  Set  Synchronization 


12  June  1990  11:24:28am  set-sync-lOO-lOO  PiSim  version  0.1 


task  type  profile 

ACK . 

ADJUST-BARRIER. . . 

LEAF . 

REPLY-VALUE . 


...126  (  0.5*/,) 

12,600  (  49.3*/,) 
...100  (  0.4y,) 
12,726  (  49.8%) 


179 


SET-SYHC-TEST 

STEM . 

total . 


. 1  (  0.0*/.) 

- 26  (  0.1*/.) 

25,579  (100. Oy.) 


task  status  profile 


lEV . 25,579  (  62. 3*/.) 

WAITIIG . 15,453  (  37.7*/,) 

total . 41,032  (100.0*/,) 


task  run  time  profile 

9-16 . 

2,049-4,096 . 

8,193-16,384 . 

32,769-65,636 . 

total . 


25,452  (  99.6*/.) 

. 1  (  0.0*/,) 

- 26  (  0.1*/,) 

...100  (  0.4*/,) 

25,679  (100.0*/,) 


task  uait  time  profile 

0 . . 

1 . 

2 . 

3-4 . 

5-8 . 

9-16 . 

17-32 . 

33-64 . 

66-128 . 

129-256 . 

257-512 . 

513-1,024 . 

1,025-2,048 . 

131,073-262,144 . 

262,146-624,288 . 

total . 


11 

,207 

( 

43.8%) 

.100 

( 

0.4%) 

.202 

( 

0.8%) 

. .  .1 

( 

0.0%) 

.402 

( 

1.6%) 

.412 

( 

1.6%) 

.106 

( 

0.4%) 

.1 

,119 

( 

4.4*/,) 

.3 

,543 

( 

13.9%) 

.1 

,336 

( 

6.2%) 

.4 

,219 

( 

16.5%) 

.2 

,606 

( 

10.2%) 

•  • 

.200 

( 

0.8%) 

*  , 

.100 

( 

0.4*/,) 

*  * 

..27 

( 

0.1%) 

25 

,579 

(100.0%) 

instruction  type  profile 

ADJUST-COUIT . 

AHOTHER-IODE-ID . 

ATTRIBUTE . 

BRAICH-MOT-MIIUS . 

BRAiCH-MOT-ZERO . 

BRAICH-ZERO . 

CEILIIG . 

CREATE-READ-WRITE-SEGMEHT 

DESTROY-SEGMEIT . 

IIITIALIZE . 

MIIUS . 

MOVE . 

PLUS . 

SEID . 

TEST-COUIT . 

total . 


...15,453  (  1.3*/,) 

. 126  (  0.0*/,) 

. 163  (  0.0*/,) 

. 25  (  0.0*/,) 

..622,960  (  44.9*/,) 
....5,601  (  0.6*/,) 

. 125  (  0.0*/,) 

. 63  (  0.0*/,) 

...26,632  (  2.2y.) 
...12,773  (  1.1*/,) 

..500,250  (  42.9*/,) 
...38,330  (  3.3*/,) 

...12,860  (  i.iy,) 

...26,678  (  2.2%) 

- 5,454  (  0.5%) 

1,166,363  (100.0%) 


instructions  per  messages 


45.6 
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instructions  per  task 


45.6 


instructions  run  length 


28.4 


operation  type  profile 

ADJUST-COUIT . 

AMOTHER-HODE-ID . 

ATTRIBUTE . 

BRAICH-MOT-MIIUS . 

BRAICH-HOT-ZERO . 

BRAICH-ZERO . 

CEILIHG . 

COMPARE . 

CREATE-READ-WRITE-SEGMEHT 

DESTROY-SEGMEIT . 

MIIUS . 

HODE-ID . 

PLUS . 

READ . 

SELF . 

SEHD . 

TEST-COUHT . 

WRITE . 

total . 


...16,453  (  0.3'/,) 

. 126  (  0.0*/,) 

. 163  (  o.oy.) 

. 26  (  0.0*/.) 

..522,960  (  10.9*/,) 

- 6,601  (  o.iy.) 

. 126  (  o.oy.) 

...25,726  (  0.5*/,) 

. B3  (  o.oy.) 

...25,632  (  0.6*/.) 

..500,250  (  10.4*/.) 

. 126  {  0.0*/.) 

...12,860  (  0.3y.) 
1,271,976  (  26. 6*/.) 
1,832,068  (  38.2*/.) 
...25,678  (  0.6*/,) 

....5,464  (  0,1*/.) 

..551,781  (  11.5*/,) 
4,796,926  (100.0*/,) 


operations  per  instruction 


4.1 


segment  size  profile 


1 . 1  (  0.0*/.) 

4 . 6  (  0.0*/.) 

6 . 47  (  0.2*/.) 

6 . 25,462  (  99.3*/.) 

9 . 101  (  0,4*/.) 

14 . 26  (  0.1*/.) 

total . 26,632  (100.0*/,) 


average  segment  size 


6.0 


segment  age  profile 

9-16 . 

17-32 . 

33-64 . 

66-128 . 

129-266 . 

257-512 . 

613-1,024 . 

1,026-2,048 . 

262,146-624,288 - 

total . 


11 

.611 

( 

46.3*/.) 

.  . 

.818 

( 

3.2*/.) 

.  . 

710 

( 

2.8*/.) 

.3 

,646 

( 

14.2*/.) 

.1 

,543 

( 

6.0*/.) 

.4 

,217 

( 

16.6*/.) 

.2 

,708 

( 

10.6*/.) 

•  . 

200 

( 

o.8y.) 

•  • 

180 

( 

o.Ty.) 

26 

632 

(100.0*/.) 

last  tick  at  301,010 


simulation  time:  22  minutes  2  seconds 
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B.4  Object  Name  Translation 


12  June  1990  ll:20:55aiii  X-LATE-TEST  PiSia  version  0.1 


task  type  profile 

ADJUST-BARRIER . 204  (  7.r/.) 

AGEIT-IHSERT . 102  (  3. 9*/.) 

AGEIT-MATCH . 202  (  7.6*/.) 

CLIEIT-COKPARE . 200  (  7.6%) 

CLIEHT-IHSERT . 102  (  3. 9*/.) 

CLIEIT-MATCH . 2  (  0.1*/,) 

CLIEIT-REMATCH . . 1  (  0 . 0*/.) 

CLOSEST-AGEHT . 625  (  23.6*/,) 

COMPARE-ALL-CLIEITS . 2  (  0.1*/,) 

CREATE-AGEIT . 25  (  0.9*/,) 

CREATE-CLIEIT . 25  (  0.9*/.) 

CREATE-MAP . 10  (  0.4*/,) 

IIITIALIZE-TRAMSLATIOHS . 1  (  0.0%) 

IISERT . 102  (  3.9%) 

IISERT-II-MAP . 102  (  3.9%) 

MAP-LOCATIOI-UPDATE . 250  (  9.4%) 

MATCH . 203  (  7.7%) 

MATCH-II-MAP . 202  (  7.6%) 

REMATCH . 1  (  0.0%) 

REPLY-VALUE . 285  (  10.8%) 

TEST-1 . 1  (  0.0%) 

TEST-2 . 1  (  0.0%) 

X-UTE-TEST . 1  (  0.0%) 

total . 2,649  (100.0%) 


task  status  profile 

CALL . 308  (  9.0%) 

lEH . 2,341  (  68.7%) 

WAITIIG . 757  (  22.2%) 

total . 3,406  (100.0%) 


task  run  time  profile 

9-16 . 793  (  29.9%) 

17-32 . 768  (  28.6%) 

33-64 . 1,093  (  41.3%) 

66-128 . 1  (  0.0%) 

613-1,024 . 1  (  0.0%) 

2,049-4,096 . 3  (  0.1%) 

total . 2,649  (100.0%) 


task  wait  tiae  profile 

0 . 1,393  (  62.6%) 
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1  . 9  (  o.sy.) 

2  . 63  (  2. 4’/.) 

3-4 . 15  (  o.ey.) 

5-8 . 15  (  0.6*/.) 

9-16 . 59  (  2. 2’/.) 

17-32 . 79  (  3.0y.) 

33-64 . 94  (  3.5*/.) 

65-128 . 399  (  15.1*/.) 

129-256 . 85  (  3. 2*/.) 

257-512 . 120  (  4.5*/.) 

513-1,024 . 56  (  2.1*/.) 

1,025-2,048 . 4  (  0.2y.) 

2,049-4,096 . 14  (  0.5*/,) 

4,097-8,192 . 219  (  8.3*/.) 

8,193-16,384 . 24  (  0.9y.) 

16,385-32,768 . 1  (  O.O*/.) 

total . 2,649  (100.0*/,) 


instruction  type  profile 

ADJUST-COUNT . 310  (  2.2*/.) 

ANOTHER-NODE-ID . 26  (  0.2*/,) 

ATTRIBUTE . 358  (  2. 5*/.) 

BRAMCH-HOT-HINUS . 600  (  4.3*/,) 

BRAHCH-NOT-ZERO . 578  (  4.1*/.) 

BRANCH-ZERO . 2,318  (  16. 5*/.) 

CALL . 614  (  4. 4*/.) 

CREATE- ASSOCIATIVE-SEGMENT . 25  (  0.2*/,) 

CREATE-READ-WRITE-SEGMENT . 102  (  0.7*/,) 

DESTROY-SEGMENT . 2,649  (  18. 9*/.) 

DISTANCE . 61  (  0.4*/,) 

INITIALIZE . 5  (  0.0*/,) 

INSERT . 506  (  3.6*/,) 

MATCH . 203  (  1.4’/,) 

MOD . 604  (  4.3*/,) 

MOVE . 895  (  6.4’/,) 

NODE-ID . 62  (  0.4*/,) 

PLUS . 1,235  (  8.8*/,) 

PRINT-USER . 3  (  0.0*/,) 

REMOVE . 1  (  0.0*/,) 

RETURN . 205  (  1.5*/,) 

SEND . 2,479  (  17.6*/,) 

TEST-COUNT . 210  (  1.5*/,) 

total . 14,049  (100.0*/,) 


instructions  per  messages  6. 


instructions  per  task  5.3 


instructions  run  length  4.5 


operation  type  profile 

ADJUST-COUNT . 310  (  0.4*/,) 

ANOTHER-NODE-ID . 61  (  0.1*/,) 

ATTRIBUTE . 358  (  0.4*/,) 


183 


BRAHCH-HOT-MIHUS . 600  (  0.7*/,) 

BRAHCH-NOT-ZERO . 578  (  0.7*/,) 

BRAHCH-ZERO . 2,315  (  2.r/.) 

CALL . 616  (  O.r/.) 

COMPARE . 2,891  (  3.3'/.) 

CREATE-ASSOCIATIVE-SEGHENT . 35  (  0.0*/,) 

CREATE-READ-WRITE-SEGMEHT . 102  (  0.1*/,) 

DESTROY-SEGMEHT . 2,649  (  3.1*/.) 

DISTAHCE . 661  (  0.8*/,) 

IMSERT . 304  (  0.4*/,) 

MATCH . 405  (  0.5'/.) 

MOD . 604  (  O.r/,) 

HODE-ID . 1,221  (  1.4*/,) 

PLUS . 1,235  (  1.4*/,) 

PRIHT-USER . 3  (  0.0*/,) 

READ . 32,193  (  37. 1*/.) 

REMOVE . 1  (  0.0*/,) 

RETURN . 204  (  0.2*/,) 

SELF . 33,182  (  38.2*/.) 

SEND . 2,340  (  2. 7*/.) 

TEST-COUNT . 210  (  0.2*/.) 

TIMES . 301  (  0.3*/.) 

WRITE . 3,421  (  3.9*/.) 

total . 86,800  (100.0*/,) 


operations  per  instruction  6.2 


segment  size  profile 

3  . 25  (  0.9*/,) 

4  . 25  (  0.9*/.) 

5  . 1  (  0.0*/,) 

6  . 702  (  25.2*/,) 

7  . 302  (  10.8*/.) 

8  . 455  (  16.3*/,) 

9  . 437  (  15.7*/,) 

10  . 827  (  29.7*/.) 

25 . 12  (  0.4*/.) 

total . 2,786  (100.0*/.) 


average  segment  size  8.1 


segment  iige  profile 

9-16 . 604  (  22.8*/,) 

17-32 . 261  (  9.9*/,) 

33-64 . 786  (  29.7*/) 

65-128 . 258  (  9.7*/,) 

129-256 . 293  (  11.1*/,) 

257-512 . 113  (  4.3*/.) 

513-1,024 . 71  (  2.r/,) 

1,026-2,048 . 3  (  0.1*/,) 

2,049-4,096 . 15  (  0.6*/,) 

4,097-8,192 . 218  (  8.2*/,) 

8,193-16,384 . 26  (  1.0*/,) 

16,385-32,768 . 1  (  0.0*/.) 


184 


total 


2,649  (lOO.oy.) 


last  tick  at  18,440 


simulation  time;  44  seconds 


B.5  Non-Resident  Handlers 


12  June  1990  11:22; 41am  NR- HANDLER-TEST  PiSim  version  0.1 


task  type  prolile 

ADD-HANDLER . 1  (  0.07.) 

ADJUST-BARRIER . 67  (  2.2*/.) 

AGENT-INSERT . 1  (  0.0*/,) 

AGENT-MATCH . 64  (  2.1*/,) 

CLOSEST-AGENT . 1,600  (  53.0*/,) 

CREATE- AGENT . 25  (  0 . 8*/.) 

CREATE-CLIENT . 64  (  2.1*/,) 

CREATE-MAP . 10  (  0.3'/,) 

CREATE-NODE-MAPS . 64  (  2.1'/,) 

DISPATCH . 128  (  4.2’/.) 

FETCH-HANDLER . 64  (  2.1*/,) 

INITIALIZE-NR-HANDLERS . 1  (  0.0*/.) 

INITIALIZE-TRANSLATIONS . 1  (  0.0*/,) 

INSERT . 1  (  0.0'/.) 

INSERT-IN-MAP . 1  (  0.0*/,) 

LOOKUP-HANDLER . 97  (  3. 2*/.) 

MAP-LOCATION-UPDATE . 250  (  8.3*/,) 

MATCH . 64  (  2.1*/,) 

MATCH-IN-MAP . 64  (  2.1*/,) 

NR-HANDLER-TEST . 1  (  0.0*/,) 

REPLY-HANDLER . 64  (  2.1*/,) 

REPLY- VALUE . 261  (  8.6*/.) 

TEST-NR-HANDLER . 128  (  4. 2*/.) 

total . 3,021  (100.0*/.) 


task  status  profile 


CALL . 290  (  7.9*/.) 

NEW . 2,731  (  74.9*/.) 

WAITING . 627  (  17.2*/.) 

total . 3.648  (100.0*/.) 


task  run  time  profile 


9-16 . 521  (  17.2*/.) 

17-32 . 247  (  8.2*/.) 

33-64 . 2,058  (  68.1*/.) 


185 


65-128 . 65  (  2.2*/.) 

129-256 . 128  (  4. 2’/.) 

1,025-2,048 . 2  (  0.1*/,) 

total . 3.021  (100.0*/,) 


task  «ait  time  proiile 

0 . 1.720  (  56.9*/.) 

2 . 120  (  4.0*/,) 

3-4 . 11  (  0.4*/,) 

5-8 . 15  (  0.5*/,) 

9-16 . 141  (  4.7*/,) 

17-32 . 69  (  2. 3*/.) 

33-64 . 142  {  4.r/.) 

65-128 . 112  (  3.r/.) 

129-256 . 115  (  3. 8*/.) 

257-512 . 76  (  2.5*/,) 

513-1,024 . 158  (  5.2*/,) 

1,025-2,048 . 112  (  3.7%) 

2,049-4,096  . 219  (  7.2%) 

4,097-8,192 . 11  (  0.4%) 

total . 3,021  (100.0%) 


instruction  type  profile 

ADJUST-COUHT . 71  (  0.3%) 

ANOTHER-MODE-ID . 1  (  0.0%) 

ATTRIBUTE . 195  (  0.8%) 

BRANCH-MINUS . 960  (  4.0%) 

BRANCH-NOT-MINUS . 1,536  (  6.4%) 

BRANCH-NOT-ZERO . 1,043  (  4.4%) 

BRANCH-PLUS . 1  (  0.0%) 

BRANCH-ZERO . 4,998  (  20.9%) 

CALL . 324  (  1.4%) 

CALL-SEGMENT . 256  (  1.1%) 

CREATE- ASSOCIATIVE-SEGMENT . 256  (  1.1%) 

CREATE-READ-WRITE-SEGMENT . 396  (  1.7%) 

DESTROY-SEGMENT . 3,149  (  13.2%) 

DISTANCE . 187  (  0.8%) 

INITIALIZE . 453  (  1.9%) 

INSERT . 258  (  1.1%) 

MATCH . 514  (  2.1%) 

MINUS . 193  (  0.8%) 

MOD . 65  (  0.3%) 

MOVE . 1,832  (  7.r/.) 

NODE-ID . 188  (  0.8%) 

PLUS . 3,743  (  15.7%) 

READ . 194  (  0.8%) 

REMOVE . 64  (  0.3%) 

RETURN . 64  (  0.3%) 

SEND . 2,774  (  11.6%) 

SEND-SEGMEKT . 64  (  0.3%) 

TEST-COUNT . 8  (  0.0%) 

TIMES . 128  (  0.5%) 

total . 23,915  (100.0%) 
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instructions  per  messages 


8.8 


instructions  per  task  7.9 


instructions  run  length  7.1 


operation  type  profile 

ADJUST-COUHT . 71  (  O.l*/,) 

ANOTHER-KODE-ID . 38  (  0.0*/,) 

ATTRIBUTE . 195  (  0.1*/.) 

BRAMCH-MIHUS . 960  (  0.7*/,) 

BRAMCH-NOT-MINUS . 1,536  (  1.1*/,) 

BRAHCH-HOT-ZERO . 1,043  (  0.8*/,) 

BRAICH-PLUS . 1  (  0.0*/,) 

BRAHCH-ZERO . 4,998  (  3.7*/,) 

CALL . 324  (  0.2*/,) 

CALL-SEGMENT . 266  (  0.2*/,) 

COMPARE . 6,296  (  4.6*/,) 

CREATE-ASSOCIATIVE-SEGMEHT . 266  (  0.2*/.) 

CREATE-READ-WRITE-SEGMENT . 396  (  0.3*/,) 

DESTROY-SEGMENT . 3,149  (  2.3*/,) 

DISTANCE . 1,723  (  1.3*/,) 

INSERT . 194  (  0.1*/,) 

MATCH . 578  (  0.4*/,) 

MINUS . 321  (  0.2*/,) 

MOD . 66  (  0.0*/,) 

NODALS . 1,093  (  0.8*/,) 

NODE-ID . 2,337  (  l.T/,) 

PLUS . 3,743  (  2.7*/,) 

PROBE . 640  (  0.5*/,) 

READ . 45,674  (  33.6*/,) 

REMOVE . 64  (  0.0*/,) 

RETURN . 64  (  0.0*/,) 

SELF . 49,656  (  36.4*/,) 

SEND . 2,666  (  2.0*/,) 

SEND-SEGMENT . 64  (  0.0*/,) 

TEST-COUNT . 8  (  0.0*/,) 

TIMES . 128  (  0.1*/,) 

WRITE . 7,956  (  5.8*/,) 

total . 136,602  (100.0*/,) 


operations  per  instruction 


5.7 


segment  size  profile 


1 . 64  (  1.7*/,) 

3  . 129  (  3.5*/,) 

4  . 163  (  4.2*/,) 

6 . 193  (  5.27.) 

6  . 499  {  13.5*/,) 

7  . 65  (  1.8*/,) 

8  . 442  (  12.0*/,) 

9  . 258  (  7.0*/,) 

10  . 1,741  (  47.3*/,) 

12 . 1  (  0.0*/.) 


187 


14. .  . 

25. .  . 
total 


..128  (  3.5*/.) 

...10  (  0.3*/,) 

3.683  (100.0*/,) 


average  segment  size 


8.4 


segment  age  profile 

9-16 . 

17-32 . 

33-64 . 

65-128 . 

129-256 . 

257-512 . 

513-1,024 . 

1.025-2,048 . 

2,049-4,096 . 

4,097-8,192 . 

8,193-16,384 . 

total . 


...301  ( 

9 

6*/.) 

...204  ( 

6 

5*/.) 

.1,455  ( 

46 

2*/.) 

...301  ( 

9 

6*/.) 

...285  ( 

9 

1*/.) 

....75  ( 

2 

4*/.) 

..  .180  ( 

5 

7*/.) 

...113  ( 

3 

6*/.) 

. .  .216  ( 

6 

9*/.) 

....18  ( 

0 

6*/.) 

. 1  ( 

0 

0*/.) 

..3,149  (100 

0*/.) 

last  tick  at  12.444 


simulation  time:  50  seconds 


B.6  N-Body  Simulation  (10  Bodies) 


12  June  1990  1:12:04pm  n-body-10-1000  PiSim  version  0.1 


task  type  profile 

ADD-ACCELERATIOK . 90,000  (  50. 0*/.) 

IMITIALIZE-SYSTEH . 1  (  0.0*/,) 

REPLY-VALUE . 10  (  0.0*/,) 

START-BODY . 10  (  0.0*/,) 

START-INTERACTION . 45  (  0.0*/,) 

UPDATE-POSITION . 90,000  (  50. 0*/.) 

WRITE-INTERACTION-LOCATION . 90  (  0.0'/,) 

total . 180,156  (100.0*/,) 


task  status  profile 


NEW . 180,156  (  74.7*/,) 

WAITING . 61,086  (  25.3*/,) 

total . 241,242  (100. 0*/.) 


task  run  time  profile 


9-16 . 10  (  0.0*/.) 

17-32 . 180,090  (100.0*/,) 


188 


2,049-4,096 _ 

65,537-131,072. 
131,073-262,144 
262,145-524,288 
total . 


. 1  (  0.0*/,) 

. 39  (  0.0%) 

. 6  (  0.0%) 

. 10  (  0.0%) 

180,156  (100.0%) 


task  salt  time  profile 

0 . 

1 . 

2 . 

3-4 . 

5-8 . 

9-16 . 

17-32 . 

33-64 . 

65-128 . 

129-256 . 

257-512 . 

524,289-1,048,576 . 

total . 


101,644  ( 

56.4%) 

. 4  ( 

0.0%) 

..1,994  ( 

1 . 1%) 

....385  ( 

0.2%) 

..4,017  ( 

2.2%) 

. .1,822  ( 

1.0%) 

..5,218  ( 

2.9%) 

.14,020  ( 

7.8%) 

.17,019  ( 

9.4%) 

.26,986  ( 

15.0%) 

..6,992  ( 

3.9%) 

. 55  ( 

0.0%) 

180,156  (100.0%) 

instruction  type  profile 


ADJUST-COUNT . 

. 100 

,000  ( 

4, 

.9%) 

ANOTHER-NODE-ID . 

. 

.  .10  ( 

0, 

.0%) 

ATTRIBUTE . 

. 

.190  ( 

0, 

.0%) 

BRAMCH-NOT-MIHUS . 

. 

...1  ( 

0, 

.0%) 

BRAHCH-NOT-ZERO . 

. 246 

,264  ( 

12, 

.  1%) 

CREATE-READ-WRITE-SEGMENT _ 

,.22  ( 

0 

.0%) 

DESTROY-SEGMENT . 

. 180 

,178  ( 

8 

.9%) 

EXPONENT . 

. 45 

,000  ( 

2, 

.2%) 

INITIALIZE . 

. 30 

,078  ( 

1. 

.5%) 

MINUS . 

. 141 

,063  ( 

7, 

.0%) 

MOVE . 

. 180 

,380  ( 

8. 

.9%) 

PLUS . 

. 410 

,163  ( 

20, 

.2%) 

PRINT-USER . 

. 

,220  ( 

0. 

0%) 

RANDOM . 

. 

.20  ( 

0, 

0%) 

SEND . 

. 180 

,224  ( 

8. 

9%) 

TEST-COUNT . 

. 20 

,000  ( 

1 , 

0%) 

TIMES . 

. 495 

,090  ( 

24, 

,4%) 

total . 

_ 2,028 

,903  (100. 

0%) 

instructions 

per 

messages 

11.3 

instructions 

per 

task 

11.3 

instructions 

run 

length 

8.4 

operation  type  profile 


ADJUST-COUNT . 

. 100,000 

( 

0.8%) 

ANOTHER-NODE-ID . 

. 56 

( 

0.0%) 

ATTRIBUTE . 

. - _ 190 

( 

0.0%) 

BRANCH-NOT-MINUS . 

. 1 

( 

0.0%) 

BRANCH-NOT-ZERO . 

. 246,264 

( 

1.9%) 

COMPARE . 

. 235,064 

( 

1.8%) 

189 


CREATE-READ-WRITE-SEGMEHT . 22  (  0 . 0*/. ) 

DESTROY-SEGMEHT . 180, i 78  (  1.4’/.) 

EXPOIEHT . 45,000  (  0.47.) 

MIIUS . 90,047  (  0.77.) 

MOD . 1.200  (  0.07,) 

MODE-ID . 100  (  0.07.) 

PLUS . 410,163  (  3.27.) 

PRIIT-USER . 220  (  0.07.) 

RAMDOM . 20  (  0.07.) 

READ . 4,376,106  (  34.47.) 

SELF . 5,077.747  (  40.07.) 

SEMD . 180.165  (  1.47.) 

TEST-COUMT . 20,000  (  0.27.) 

TIMES . 496,090  (  3.97.) 

WRITE . 1,250,809  (  9.87.) 

total . 12,708,432  (100.07,) 


operations  per  instruction 


segment  size  profile 

6  . 90,010  (  50.07.) 

7  . 90,000  (  50.07.) 

9  . 110  (  0.17.) 

10  . 2  (  0.07.) 

11  . 1  (  0.07.) 

19 . 10  (  0.07.) 

23 . 45  (  0.07.) 

total . 180,178  (100.07,) 


average  segment  size 


segment  age  profile 

17-32 . 108,053  (  60.07.) 

33-64 . 8,045  (  4.57.) 

65-128 . 25,020  (  13.97.) 

129-256 . 25,997  (  14.47.) 

257-512 . 12,986  (  7.27.) 

2,049-4,096 . 3  (  0.07.) 

524,289-1,048,576 . 75  (  0.07.) 

total . 180,178  (100.07.) 


last  tick  at  851,237 


simulation  time:  1  hour  10  minutes  57  seconds 


B.7  N-Body  Simulation  (100  Bodies) 


190 


12  June  1990  3:16:48pm  n-body-100-10  PiSim  version  0.1 


task  type  profile 

ADD-ACCELERATIOH . 99,000  (  46. 5*/.) 

IMITIALIZE-SYSTEM . 1  (  0.0’/,) 

REPLY-VALUE . 100  (  0.0*/.) 

START-BODY . 100  (  0.0*/,) 

START-INTERACTIOH . 4.9S0  (  2.3*/.) 

UPDATE-POSITIOH . 99,000  (  46.5*/.) 

WRITE-IHTERACTIDH-LOCATION . 9,900  (  4.6*/.) 

total . 213,051  (100. 0*/.) 


task  status  profile 


HEW . 213,051  (  74.9*/,) 

WAITIHG . 71,324  (  25. 1*/.) 

total . 284,375  (100.0*/,) 


task  run  time  profile 

9-16 . 

17-32 . 

1,025-2,048 . 

16,385-32,768 . 

131,073-262,144 . 

total . 


- 100  (  0.0*/,) 

207,900  (  97.6*/,) 
..4,950  (  2.3*/,) 

- 100  (  0.0*/,) 

. 1  (  0.0*/,) 

213,051  (100.0*/,) 


task  vait  time  profile 

0 . 

1 . 

2 . 

3-4 . 

5-8 . 

9-16 . 

17-32 . 

33-64 . 

65-128 . 

129-256 . 

257-512 . 

513-1,024 . 

1,025-2,048 . 

2,049-4,096 . 

4,097-8,192 . 

8,193-16,384 . 

16,385-32,768 . 

32,769-65,536 . 

66,537-131,072 . 

131,073-262,144 . 

262,145-624,288 . 

524,289-1,048,676 . 

total . 


.36,647  (  17.2*/,) 
....196  (  0.1*/,) 

- 174  (  0.1*/,) 

- 342  (  0.2*/,) 

- 956  (  0. ■»*/,) 

..2,255  (  1.1*/,) 

..2,669  (  1.2*/,) 

..3,931  (  1.8*/,) 

..7,697  (  3.6*/,) 

..6,706  (  3.1*4) 

..6,717  (  3.2*/,) 

.12,629  (  5.9*/,) 

.39,767  (  18.7*/,) 
.48,620  (  22.8*/,) 
.32,736  (  15.4*/,) 
..5,883  (  2.8*/,) 

. 24  (  0.0*/,) 

. 73  (  0.0*/,) 

- 132  (  O.l*/,) 

. B7  (  0.0%) 

..4,979  (  2.3%) 

. 71  (  0.0%) 

213,051  (100.0%) 


instruction  type  profile 

ADJUST-COUIT . 100,000  (  4.6%) 

AHOTHER-IODE-ID . 100  (  0.0%) 
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ATTRIBUTE . 

BRAICH-HOT-MIHUS . 

BRAICH-HOT-ZERO . 

CREATE-READ-WRITE-SEGMEHT 

DESTROY-SEGMEHT . 

EXPOIEIT . 

IHITIALIZE . 

MIIUS . 

MOVE . 

PLUS . 

PRIIT-USER . 

RAIDOM . 

SEHD . 

TEST-COUMT . 

TIMES . 

total . 


...19,900  (  0.9*/.) 

. 1  (  o.oy.) 

..254,661  (  11. 3'/,) 

. 202  (  o.oy.) 

..213,253  (  9.6*/.) 

...49,500  (  2.2’/,) 

_ 8,253  (  0.4*/,) 

..170,765  (  7. 6*/.) 

..237,800  (  10.6*/,) 
..416,148  (  18.5*/,) 

. 2  (  0.0*/,) 

. 200  (  0.0*/,) 

..216,560  (  9.r/,) 

_ 2,000  (  0.1*/,) 

..554,400  (  24.7*/,) 
2,243,745  (100. 0*/.) 


instructions 

per 

tnessa2Ci. 

10. 

,5 

instructions 

per 

task 

10, 

.5 

instructions 

run 

length 

7, 

.9 

operation  type  profile 


ADJUST-COUHT . 

100,000  ( 

0. 

.7%) 

AIOTHER-HODE-ID . 

..5,052  ( 

0, 

.0%) 

ATTRIBUTE . 

.19,900  ( 

0, 

.  1%) 

BRAMCH-MOT-MIIIUS . 

. 1  ( 

0 

.0%) 

BRAMCH-HOT-ZERO . 

254,661  ( 

1 

.8%) 

COMPARE . 

253,649  ( 

1. 

.8%) 

CREATE-READ-WRITE-SEGMEHT _ 

....202  ( 

0, 

.0%) 

DESTROY-SEGMEHT . 

213,253  ( 

1. 

.5%) 

EXPOHEHT . 

.49,500  ( 

0. 

.3%) 

Minus . 

103,952  ( 

0, 

.7%) 

MOD . 

. 12  ( 

0, 

.0%) 

HODE-ID . 

.10,000  ( 

0, 

.  1%) 

PLUS . 

416,148  ( 

2. 

.9%) 

PRIHT-USER . 

. 2  ( 

0, 

.0%) 

RAHDOM . 

....200  ( 

0. 

.0%) 

READ . 

_ 4 

,956,243  ( 

34 

.9%) 

SELF . 

....5 

,685,447  ( 

40, 

.0%) 

SEHD . 

. . 

.213,050  ( 

1. 

.5%) 

TEST-COUHT . 

•  •  .  .  .  c 

...2,000  ( 

0. 

0%) 

TIMES . 

. . 

.554,400  ( 

3. 

9%) 

WRITE . 

_ 1 

,370,554  ( 

9. 

6%) 

total . 

. . .14, 

,208,226  (100. 

.0%) 

operations  per  instruction 


6.3 


segment  size  profile 


6  . 99,100  (  46. 5*/.) 

7  . 99,000  (  46.4%) 

9 . 9,900  (  4.6%) 

11 . 1  (  0.0%) 


192 


19 . 100  (  O.O*/,) 

23 . 4,980  (  2. 3*/.) 

99  . 200  (  0.1*/.) 

100  . 2  (  o.oy.) 

total . 213,253  (100.0*/,) 


average  segment  size 


7.1 


segment  age  profile 

17-32 . 

33-64 . 

65-128 . 

129-256 . 

257-512 . 

513-1,024 . 

1,025-2,048 . 

2,049-4,096 . 

4,097-8,192 . 

8,193-16,384 . 

16,385-32,768 . 

32,769-65,536 . 

65,537-131,072 . 

131,073-262,144 - 

262,145-524,288 - 

524,289-1,048,576. . 
total . 


.  .39 

430 

( 

18 

5*/.) 

. .  .5 

300 

( 

2 

5*/.) 

. .  .7 

,582 

( 

3 

654) 

. .  .8 

,461 

( 

4 

0*/.) 

. .  .7 

057 

( 

3 

3*/.) 

.  .12 

,275 

( 

5 

8*/.) 

.  .39 

,674 

( 

18 

6*/.) 

.  .48 

985 

( 

23 

0*/.) 

.  .33 

014 

( 

15 

5*/.) 

. .  .5 

,935 

v. 

2 

8*/.) 

.... 

.24 

( 

0 

0*/.) 

.... 

.73 

( 

0 

054) 

.... 

132 

( 

0 

1*/.) 

.... 

.60 

( 

0 

0*/.) 

. .  .4 

871 

( 

2 

3*/.) 

.... 

379 

( 

0 

2*/.) 

.213 

253 

(100 

0*/.) 

last  tick  at  359,691 


simulation  time;  2  hours  27  minutes  34  seconds 


B.8  Relaxation 


12  June  1990  0:56 .•52pm  relax-100  PiSim  version  0.1 


task  type  profile 

REUX . 

REPLY- VALUE . 

START-ELEMEIT - 

UPDATE-TEMP . 

total . 


. 1  (  0.054) 

...100  (  0.354) 
...100  (  0.354) 
36,000  (  99.454) 
36,201  (100.0)4) 


task  status  pr^'flle 


lEW . 36,201  (  62.1)4) 

WAITIIG . 22,110  (  37.954) 

total . 58,311  (100. 0*/.) 


193 


task  run  time  profile 

9-16 . 100  (  o.sy.) 

17-32 . 35,965  (  99.3'/,) 

33-64 . 15  (  0.0*/,) 

65-128 . 17  (  0.0’/.) 

129-256 . 3  (  O.O*/,) 

4,097-8,192 . 24  (  0.1*/,) 

8,193-16,384 . 77  (  0.2*/,) 

total . 36,201  {100. 0*/.) 


task  vait  time  profile 

0 . 7,923  (  21. 9*/.) 

1  . 190  (  0.5*/) 

2  . 295  (  0.8*/,) 

3-4 . 517  (  1.4*/,) 

E-8 . 1,324  (  3.7’/,) 

9-16 . 2,394  (  6.6*/,) 

17-32 . 3,122  (  8. 6*/.) 

33-64 .  4,181  (  11.5*/,) 

65-128 . 5,896  (  16.3'/,) 

129-256 . 6,379  (  17.6*/,) 

257-512 . 3,795  (  10. 5*/.) 

513-1,024 . 57  (  0.2*/,) 

1,025-2,048 . 27  (  0.1*/,) 

16,385-32,768 . 100  (  0.3*/,) 

32,769-65,536 . 1  (  O.O’/,) 

total . 36,201  (100.0%) 


instruction  type  profile 

ATTRIBUTE . 440  (  0.1%) 

BRAICH-MIIUS . 10,200  (  3.3%) 

BRASCH-HOT-MIHUS . 110  (  0.0%) 

BRAICH-HOT-ZERO . 36,447  (  11.9%) 

BRAICH-ZERO . 40,247  (  13.2%) 

CREATE-ASSOCIATIVE-SEGMEHT . 64  (  0.0%) 

CREATE-READ-WRITE-SEGMEHT . 1  (  0.0%) 

DESTROY-SEGMEHT . 36,202  (  11.8%) 

DIVIDE . 10,000  (  3.3%) 

IHITIALIZE . 281  (  0.1%) 

IISERT . 100  (  0.0%) 

MATCH . 36,201  (  11.8%) 

MIIUS . 211  (  0.1%) 

MOD . 500  (  0.2%) 

MOVE . 44,432  (  14.5%) 

PLUS . 53,975  (  17.6%) 

PRIIT-USER . 113  (  0.0%) 

SEID . 36,200  (  11.8%) 

SUSPEID . 147  (  0.0%) 

TIMES . 2  (  0.0%) 

total . 305,873  (100.0%) 


instructions  per  messages  8.4 


194 


instructions  per  task 


8.4 


instructions  run  length 


5.2 


operation  type  profile 

ATTRIBUTE . 

BRAICH-MIMUS . 

BRAICH-HOT-MIHUS . 

BRAMCH-MOT-ZERO . 

BRAICH-Z5R0 . 

COMPARE . 

CREATE-ASSOCIATIVE-SEGMEIT 
CREATE- READ-WRITE-SEGMERT . 

DESTROY-SEGMEIT . 

DIVIDE . 

INSERT . 

MATCH . 

Minus . 

MOD . 

HODALS . 

KODE-ID . 

PLUS . 

PRIHT-USER . 

PROBE . 

READ . 

SELF . 

SEID . 

SUSPEND . 

TIMES . 

WRITE . 

total . 


. 440  (  0.0*/,) 

...10,200  (  0.6*/.) 

. 110  {  o.oy.) 

...36,447  (  2.2*/,) 

...40,247  (  2.5'/.) 

...86,547  (  5.3*/.) 

- 64  (  0.0*/,) 

. 1  (  0.0*/,) 

...36,202  (  2.2*/.) 

...10.000  (  0.6*/.) 

. 100  (  0.0*/,) 

...36,147  (  2. 2*/.) 

. 411  (  0.0*/.) 

. 800  (  0.0*/,) 

...36,465  (  2. 2*/.) 

. 100  (  0.0*/,) 

...40,400  (  2. 5*/.) 

. Ill  (  0.0*/.) 

. 100  (  0,0*/.) 

..512,826  (  31.5*/,) 
..614,335  (  37.7*/.) 
...36,200  (  2.2*/.) 

. 147  (  0,0*/.) 

. 112  (  0.0*/.) 

..132,038  (  8.1*/.) 

1,630,550  (100.0*/,) 


operations  per  instruction 


5.3 


segment  size  profile 


2 . 64  (  0.2*/.) 

6  . 100  (  0.3*/,) 

7  . 36,000  (  99.3*/.) 

8  . 1  (  0.0*/.) 

21 . 100  (  0.3*/,) 

100 . 1  (  0.0*/.) 

total . 36,266  (100.0*/,) 


average  segment  size 


7.0 


segment  age  profile 

9-16 . 

17-32 . 

33-64 . 

65-128 . 

129-256 . 

257-612 . 

613-1,024 . 

1,026-2,048 . 


. 4  (  0.0*/,) 

..11,427  (  31.6*/,) 
...6,991  (  16.6*/,) 
...6,621  (  18.0*/,) 
,..7,015  (  19.47.) 
...6,046  (  13.9*/,) 

. 66  (  0.2%) 

. 32  (  0.1%) 
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32.769-65,536 . 102  (  0.3%) 

total . 36,202  (100.0%) 


last  tick  at  38,299 


simulation  time:  10  minutes  35  seconds 
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